NVIDIA VBO + glDrawElements() bug or feature?

I am 100% sure that it’s perfectly possible to draw more than 64K vertices per call with glDrawElements() (with or without VBO). I’m using a GF3 with the 52.16 drivers, but I don’t recall this ever being a problem with older drivers either.

There is a hard limit of 64K vertices when using VAR on GF2-class cards, but this limit was raised to 1M vertices on the GF3 and up. It also does not exist when not using VAR.

There is also a recommended maximum index/vertex count for glDrawRangeElements(), but it is nothing more than that: a recommendation. This number is 4096 for all GeForces. For Radeon 9x00 cards, it’s 64K indices/2M vertices.

Your new version consistently reports between 5 and 8 MTris/sec on my machine. The numbers look more reliable, but they also look low. I still don’t trust your timing code, though. Could you explain exactly what you’re measuring to produce all these numbers? :

glDrawXXX(): 85.959003% [CPU BUSY]
SwapBuffers(): 13.900877% [IDLE]
fps: 38.000000
batches/s: 38.000000
triangle rate: 7.600000 MilTris/s

I’m particularly interested in what conclusion you think you can draw from those first two lines.

– Tom

SwapBuffers(): 13.900877% [IDLE]

VSYNC off?

Originally posted by Relic:
VSYNC off?

Yes.

Your new version consistently reports between 5 and 8 MTris/sec on my machine. The numbers
look more reliable, but they also look low. I still don’t trust your timing code, though. Could you
explain exactly what you’re measuring to produce all these numbers? :

You have the source, and here are the measured loops. It is soo simple basic opengl usage
I really wonder if there is anything that can be made more straightforward. AND it works as expected on
ATIs.

QueryPerformanceCounter((LARGE_INTEGER*)&start_time);

unsigned int count = 1;

for( j = 0 ; j < batches_per_frame ; j++ ) {
if( use_vbo == USE_VBO_MULTIPLE_BUFFERS ) {
glBindBufferARB(GL_ARRAY_BUFFER_ARB, count++);
check_for_opengl_error();

  glBindBufferARB(GL_ELEMENT_ARRAY_BUFFER_ARB, count++);
  check_for_opengl_error();
  			
  glVertexPointer(3, GL_FLOAT, 0, 0);
  check_for_opengl_error();

}

glDrawArrays(GL_TRIANGLES, 0, faces_per_batch*3);
check_for_opengl_error();

ret.polys_rendered += faces_per_batch;
ret.objects_rendered++;
}

QueryPerformanceCounter((LARGE_INTEGER*)&cur_time);

QueryPerformanceCounter((LARGE_INTEGER*)&start_time);

glFinish();
glfwSwapBuffers();

QueryPerformanceCounter((LARGE_INTEGER*)&cur_time);

[This message has been edited by speedy (edited 11-19-2003).]

BTW. I am NOT using VBO with multiple buffers! You could try that simply by changing one typedef enum parameter in the run_test() calls.

rp = run_test(GL_DRAW_ELEMENTS, USE_VBO, INDICES_UNSIGNED_INT, FILL_UPPER_HALF_OF_THE_SCREEN, 2, 1, 200000);

–>

rp = run_test(GL_DRAW_ELEMENTS, USE_VBO_MULTIPLE_BUFFERS, INDICES_UNSIGNED_INT, FILL_UPPER_HALF_OF_THE_SCREEN, 2, 1, 200000);

[This message has been edited by speedy (edited 11-19-2003).]

Tom, Zengar, thanks for the responses, I think I have got close to the bottom of this mess.

I have made a small web presentation with the measured relation between the CPU usage by the
ie. game engine and the GPU triangle count per second (without rasterization).
(with the source & high res graphs included in the archive at the bottom of the page)

http:\kickme.to\speedy1

Zengar, tri counts are low because I am not optimizing indices in the test case
(they are 012345678…) and you could be using 52.16 drivers…

ie. for ATI,

indices 012 012 012 012 … give ~4-5MilTris/s

indices 012 345 678 … give ~34MilTris/s

indices 012 123 234 345 … give ~60MilTris/s

you can download the new OpenGL test app v4 from the next url and test this stuff for yourself
http://galileo.spaceports.com/~speedy1/OpenGL%20test%20app%20v4.rar

new features:

  • supports engine CPU usage simulation
  • erroneous serializing glFinish() removed

PITFALLS encountered:

  • Detonators 45.23 do not support primitive counts > 65536 when using VBO
  • ForceWare 52.16 have weak CPU/GPU decoupling, or I have not found a way to utilize it
  • DO NOT USE glFinish() before SwapBuffers(), let the ICD driver be smart and find the way
    on its own
  • indices optimizations concerning t&l pre and post buffers/caches can be of MAJOR significance
    concerning triangle rates
  • Forceware 52.16 have fully async SwapBuffers()

Anyone from NVIDIA please shed some light on 52.16 issues?!? Thanks in advance.