glLockArraysEXT\glUnlockArraysEXT?

Bwb: i tried with vertices packed, and 16-b aligned, and there is no difference in the speed increase.

Zed: i tried to render all the triangles out of the view frustrum, so nothing is displayed on screen. I got a " nice " constant +4 fps increase… but this was, i believe, by doing 19 ( ! ) pass, so the transforms could be reused a lot. After some calculus, i found that:

  • with CVAs, i got 19 fps for 19 pass, each pass = 10 000 tris => that’s 3 610 000 transforms per second.
  • w/o CVAs, i got 15 fps for 19 pass, each pass = 10 000 tris => that’s 2 850 000 transforms per second.
    So CVAs = ~20% increase. Unfortunately not very usefull since, in general, there is far less pass, less triangles, and a higher fillrate…

Y.

CVA (like display lists) may allow the card
to download the data to faster memory than
your malloc()-ed array memory, so with
re-use, geometry transfer will be faster.
However, if you’re using the NV_ vertex
buffer allocation extension, or if geometry
transfer (or processing, on non-HTL cards)
is not your bottleneck, it might not help
much.

Hi!
bgl: As I said before I limited the texture size to one eighth and ran my app in 32020016 to make sure it is no fillrate problem…

Greets, XBTC!

[This message has been edited by XBCT (edited 01-14-2001).]

I was talking with some of the driver folks here and there are several things which are important for CVAs to provide a significant improvement in speed. Your vertex arrays should be aligned on 16 or 32 byte boundries and should have a stride of 4 floats (16bytes) due to SSE restrictions. You should make sure the average size of your vertex arrays are reasonably sized. There is some overhead involved in using CVAs (something like 200 or more is probably fine). Lastly and maybe most important your CPU speed and bus speed are a big factor. If you have a fast CPU but a slow bus the transformations may be essentially “free” since those calcuations are taking negligable time compared to actually pushing the transformed vertex data up to the card.

dave

Daveg: that could explain a lot of things. My computer is quite old ( P200 mmx ), and was already upgraded from a 486 dx4 100. I’m quite sure the bus is extremely slow… :slight_smile:

Y.

OK, this may not apply to ALL graphics cards, but the following text was pulled from nvidias OpenGL FAQ (http://www.nvidia.com/Marketing/Developer/DevRel.nsf/3e0a464ce391addc8825681700740113/f706b8da926e1c548825685c006763d8/$FILE/OpenGL_Perf_FAQv2.doc)…messy URL


7. What do compiled vertex arrays (CVAs) buy me in terms of performance?
Although your mileage may vary, compiled vertex arrays can yield a large increase in performance over other modes of transport – specifically, if you frequently reuse vertices within a vertex array, have the appropriate arrays enabled and use glDrawElements. Only one data format is specifically optimized for use within CVAs:

Vertex Size/Type - 3/GLfloat
Normal Type - NONE
Color Size/Type - 4/GLubyte
Texture Unit 0 Size/Type - 2/GLubyte
Texture Unit 1 Size/Type - 2/GLubyte

Note that there is no corresponding glInterleavedArrays enumerant for this format (i.e. you must use glVertexPointer, glColorPointer and glTexCoordPointer to specify the arrays).

When using compiled vertex arrays with this format, it’s important to maximize use of the vertices that have been locked. For example, if you lock down 100 vertices and only use 25 of them in subsequent glDrawElements calls before unlocking, you will have relatively poor performance.

For more flexibility in accelerated data formats, it’s recommended that vertex_array_range extension be used (see below).

The two things to note here are to only lock the range you are going to use, and which vertex format to use. Note that other IHVs may optimize for different vertex formats in their drivers, but this one is probably universally supported as I suspect this is the quake format(seems like it would be the most likely one to optimize for)

In future drivers (as in, changes already made, drivers not released yet) we won’t be picky about the format like that…

  • Matt

Just thought of another thing. Its kinda hard to tell exactly what everyone here is doing in their code, so I’ll just explicitly state this to make everyone understands how to take advantage of CVAs (if you already know, dont take this posting as an insult).

The 2 biggest advantages of compiling the vertex arrays are when you either reuse verticies or are making multiple passes on the same geometry.

In the case of reusing verticies, make sure you do this in your geometry. If you have 2 triangles that share one or more common verticies, make sure that you dont duplicate the verticies in your array. In other words, if you have triange A(V1, V2, V3) and triangle B(V4, V5, V6), and verticies V3 and V4 have the same position, tex-coords, normal, and color, then you should remove V4 from your array and define triangle B as (V3, V5, V6) instead.

In the case of multiple passes, you need to only lock those arrays that will not change between passes. For example, say you have a set of triangles that have 4 textures rendered in 2 passes, and each texture needs its own set of tex coords (ie: you have 4 tex coords per vertex). You should do things in the following order:

  1. Start with all arrays disabled.
  2. Enable and set any array that will not change between passes (such as the vertex array)
  3. Lock the arrays
  4. Enable and set other arrays for the first pass. In this example we will set TexCoord Array for TMU 0 to the 1st set of tex coords, and set TexCoord Array for TMU 1 to the 2nd set of tex coords
  5. Render geometry
  6. Change whatever arrays need changing for the second pass. In this example we will set TexCoord Array for TMU 0 to the 3rd set of tex coords, and set TexCoord Array for TMU 1 to the 4th set of tex coords.
  7. Render geometry again.
  8. Repeat steps 4 and 5 as necessary
  9. Unlock the arrays

It might also be possible to enable all arrays at step one and have the driver detect when you change the tex coord array pointer after the lock, but I dont know if any drivers support this. It may also be problematic if the driver tries to compile the array and the tex coord array pointer points to a nonexistant array, so I might think this wouldnt be the best idea.

Hi!
Daveg & Lord Kronos: Not to offend you ´cause detailed explanations are very cool for people who want to learn the stuff but personally there is no single thing you mention I didn´t already try in my code.
This leads me to the conclusion that the speed up simply isn´t that great…

And thanx alot guys…I like discussions like this one alot and just want to thank you for all your input…

Greets, XBTC!

[This message has been edited by XBCT (edited 01-16-2001).]

I’ve been doing some testing recently. I’ve set up a test app that draws a simplistic mesh of 10000-15625 triangles using different rendering techniques, starting from the immediate mode. The mesh is deformed every frame so that it waves (sorta like Q3A banners). I’m filling in vertex buffer every frame and use DrawElements on it, limiting the number of elements in a single call to 1024. There is minimum performance gain when I turn on CVAs.

However, there is a substantial performance gain when I use wglAllocate’d memory for the vertex buffer using NV_vertex_array_range extension. This is no surprise, since the memory with the write-combine characteristics is perfect for this kinda thing. The trick is to fill data sequentially (otherwise it sucks) and rotate between multiple buffers because of synchronization (or use NV_fence).

So basically if you already using vertex arrays, the next performance optimization will come from NV_vertex_array_range, not the CVAs. Of course CVAs won’t hurt, barring the degenerate cases.

Actually I kinda wish you could get WC memory directly from the OS, not through the extension (to my knowledge, you can’t, at least on NT), so that you could use it on other cards that don’t have this extension. Currently you’re limited to GeForce family, am I correct?

just my two cents,
bart

Hi!

>However, there is a substantial performance gain when I use wglAllocate’d memory for the vertex buffer using NV_vertex_array_range extension. This is no surprise, since the memory with the write-combine characteristics is perfect for this kinda thing. The trick is to fill data sequentially (otherwise it sucks) and rotate between multiple buffers because of synchronization (or use NV_fence).<

Hey this sounds interesting…I´ll look into that although I won´t be able to test ´cause I have a rage128…

Thanx…

Greets, XBTC!

Even if you could allocate AGP memory directly from the OS (and you sort of can, with DirectDraw), it wouldn’t be good enough. The HW must be capable of DMA’ing vertex data and must know the size and location of the buffer in advance.

  • Matt