I’ve been playing with the GL_NV_fence and GL_NV_vertex_array_range extensions lately and have a few questions for anyone who is or has used these extensions as well.
I’m using them to render a huge terrain mesh. The array I have set up goes like this:
[X, Y, Z, nX, nY, nZ, U, V],etc…
all elements are composed of type float.
the array has been allocated with wglAllocateMemoryNV() as well.
As you can see, the array is 4-byte aligned (float) and strides are a multiple of 4. According to the doc’s from nvidia, this all checks out and is legit. I draw the arrays normally (glVertexPointer, etc…) as an indexed primitive with glDrawElements(). The problem is, it seems that I actually LOSE performance doing this. Rendering was faster with just standard vertex arrays. I’ve also taken use of the GL_NV_fence extension, setting a fence after drawing the array, and finishing the fence before i draw the array again in the next frame. This didn’t help either.
So my question is, has anyone gotten similar results where the performance is actually less? Or maybe doing this is too much overhead and performance gains won’t be seen unless I work on a larger scale mesh? The demo from nvidia was doing 30 FPS on my machine and it was doing 300,000+ polys a frame, i’m doing about 8,000 (textured and lit) and getting ~60, ~70 w/o the extensions.
incidentally I have analyzed the demo program yesterday and found it to work nicely under Windows 2000 with the 6.18 drivers.
So the questions are:
- Where was the memory allocated? (AGP or VIDEO, I got 4MB video memory with the sample in 115286432 on a 64 MB board)
- Which AGP mode (1x, 2x, 4x, FastWrites) is supported by your system?
- How is the performance difference with the same circumstances as in the demo (only vertex- and normal-pointer, no texture, with directional or point light)?
- Have you split the buffer into subbuffers and used the fence to get the hardware and your software run in parallel?
I’m allocating about 2 MB of video memory, not agp. I have a GeForce2 GTS 32MB card, 4x AGP, fast-writes disabled. I have split the buffer into 2 sub-buffers, setting a fence after each one is drawn. Upon doing some further investigation, I found that I do get a signifigant increase in performance when lighting is disabled, along with a few other features of this program. With just texturing, it was running at ~100 FPS w/ vertex array range and fences, where it only ran at 80 when using ordinary vertex arrays. So I guess my performance hit is coming from all the other stuff thats going on (lighting and a few other things)
Glad to see you found the issue. When writing the demo, it took me a while to find the bottlenecks that kept it from maximum performance.
Turning off lighting and texgen are two good ways to determine if the T&L engine is your bottleneck. Incidentally, prefer infinite lights w/ non-local viewer for best lit performance.
Regarding performance being lower with VAR – be sure to read the caveats in the VAR/fence whitepaper. If the driver has to any massaging of the vertex array data, their being in VAR memory is a liability, and will definitely hurt performance.
Well. I’ve tried the infinate lighting model, with no signifigant increase or decrease in performance. I’ve also started rendering my meshes as traingle strips rather than triangles and this actually hurt performance but not by much. Is my code just that unoptimized or what? Heheh. I don’t think it is. But I’ve gotta be doing something wrong. Oh well…back to my wonderful text editor…better put some coffee on while I’m at it. I’ll beat this bastard yet. =) Thanks for all your insight/information guys!
how can I know the memory allocated is Video memory or AGP memory? And How big it on GeForce3 card?
Thanks a lot!
In my experience it’s impossible to get anything out of VAR with hardware local lights. I think I had discussed this with you quite some time ago, Cass. I don’t remember what was said but I really didn’t think it was due to the lighting being the bottleneck, as such. More like falling into a path that couldn’t exploit VAR appropiately.
Right - if lighting costs over a certain amount, T&L won’t be able to keep up with AGP transfer rates. Exactly where this limit exists is architecture dependent, but if you’re doing a lot of vertex processing, slow vertex transfer probably won’t slow you down.