I try to push my 3d engine to get the maximum of polygone per second, and I would like to know if I’ve got to optimize a little bit more or not. My engine is optimized for GeForce family card, I use nv_vertex_array_range extension, I can draw 2 millions of polygones per second with a mesh constitued by 16384 triangles (not triangles strip), with a texture 256x256x32 bits, trilinear filtering, one light and smooth shading. My configuration is a Celeron 400, 256 Megs, GeForce2 MX 32 megs … can I do better?
2 Million sounds a little low to me.
Make sure v-sync is off
To best test poly througput switch off anything not directly related to trianlge rendering e.g. A.I., physics, collision detection, frustrum culling, Fog etc…
Your not rendering that many triangles and your frame rate is over 120, better to render your object multiple times when benchmarking, otherwise operations carried out once each frame such as clearing zbuffer and screen may effect your results.
The trick is to make sure it is the geometry that is your bottleneck and not anything else such as fill rate.
I would imagine not using triangle strips is slowing it down.
The NvTriStrip Library may help you http://www.nvidia.com/Marketing/Developer/DevRel.nsf/ProgrammingResourcesFrame?OpenPage
Is it a single mesh with 16k triangles, and that’s all you’re drawing? Are you drawing more than one copy of the mesh per frame? Are you running heavy physics, simulation, decompression or networking code, too?
2 mtri/s seems low, but if you’re only trying to draw 16ktri per frame, then it’s about on target. Try making your scene more complex if that’s the case. As long as the number of driver calls stays the same, they just work on more triangles each, you’ll probably see your tri/s number go up by doing so.
my v-sync is off, of course, and I draw tree teapot which contains 16kfaces each, I use a single vertex array for it, with a indices array too. With 50kfaces, my FPS is around 40 image per second, I am running only drawing function, synchronizing (as the light is moving), frustum culling, printing text (for fps information). When I profiled my engine, the main function that eat a lot of cycle-clock is the main drawing function … any iD for speeding up my engine?
Read through this board, and follow all of the good advice. Also read through the nVidia developer material, and follow their advice.
You could download and run the evaluation version of VTune (if it’s still available) which will give you a MUCH better view of where your program is spending its time than a regular profiler. Just make sure to set the profiling interval down to 0.1 ms or so.
What would be a general max for number of triangles per scene, on anything newer than a GeForce, when you want to get at least 60fps? I know what it is supposed to be able to handle, but what would be a realilistic ammount?
using a single vertex array is prolly gonna harm performance. i believe you want to break your mesh up into lots smaller than 64kb.
Zed is right. OpenGL is limited to 64k elements. I don’t know what it would do if it went beyond that though.
I got 15M polys (50fps windowed) drawing two objects of 150.000 polys, with VAR, multitexturing (512x512, 32 bit textures) and 32 bit rendering (no lighting).
When cube envmap and Reflection TexGen (very expensive) is enabled, I got 4M polygons.
The system specs are: P3-733, 128 Mb, GeForce2 MX.
>>OpenGL is limited to 64k elements.<<
Sorry, this statement is nonsense.
There are things/extensions which might be limited by an upper threshold of whatever elements could mean here, but that’s implementation dependent.
Is there a way to find that limit? Through a glGet() query, maybe? I’m using a global vertex array, and I would like to avoid making it a fixed size later to find that I can’t use all of it.
track down the GeForce Optimization FAQ on the Nvidia site. That will point out the better paths to take for max speed in your rendering.
Also, that 64K limit that was mentioned is in that same FAQ; it’s not a hard limit, it is the recommendation for max speed of poly throughput- use 64K (I forget if that is bytes or triangles or vertices) vertex arrays for max efficiency in your throughput.
Also, don’t celerons have really crappy caches? You could be held back by that. I avoid celerons like the black death. I even remember reading somewhere that some earlier ones didn’t even have hardware floating point…
For what it’s worth, I’ve seen 8M polys per second on my PII 450 with an AGP 1 connection to my GeF2 using my own software (full engine running, not just a geometry dump to the card) and 13.5M running that ribbon program from nvidia on the same machine.
Another point to consider: what speed is your AGP port?
You also mentioned that you have some text on the screen. How are you doing that? I’ve seen text on a OGL screen be the bottleneck too many times.
Thanks for your answer,
bsenftner : I have a celeron, but it got a FPU, a celeron is just a p2 with only 128 kbytes L2 cache and a memory at 66 Mhz! My AGP bus is only 2x, this is maybe the cause, but when I launch the SphereMark benchmark (from nvidia) I can reach 7.6M ! I saw the source code, and I do nearly the same thing !? When I allocate 32kbytes in video ram, my fps falls down, whereas it works for the SphereMark … I must have a bug in my engine, I suppose …
In order for you to reach these fabulous speeds, there is no miracle. You must reduce to <nil> the amount of work done by the CPU on each triangle, and reduce as much as possible the amount of data sent through the bus. This is why display lists, or static VAs ( or VARs ) work well, with triangle strips. So it all comes back to the question: what calculations are you doing per triangle, and how do you send data to the hardware ?
I do no calculations per triangle, I just import a 3Ds (in this case a sphere constitued by 720 faces) then draw it 100 times. I send data via CVA or VAR but with this last method I cannot exceed 2 millions of polygones per second…
If your were doing the same as the Sphere benchmark, then you should have the same performance. Period.
- Are you running your program in Debug or Release mode?
- Check for other compiler optimizationa.
- Drop your Celeron and get a real PIII!
Just a comment about Celerons:
First Celerons (up to 300MHz) - same core as P2, no 2nd level cache
Celerons from 300MHz (Celeron 300A) to 533MHz - P2 core, 128K cache
Celerons from 533MHz (A) up - P3 core, 128K cache
I hope this clarifies things Anyway, the second generation Celerons (which is what Arath has) performed almost as well as a P2 of the same speed, sometimes even surpassing it. The reason was that while the P2 had a larger cache, it was half as fast
[This message has been edited by ET3D (edited 05-02-2001).]