Drawing speed: GL_TRIANGLES vs. GL_POLYGON

devdept · November 27, 2009, 6:00am

Hi All,

Is there any benefit in terms of performances using GL_TRIANGLES or GL_QUAD instead of a generic GL_POLYGON with the proper number of vertices?

Thanks,

Alberto

Julien_Gouesse · November 27, 2009, 6:43am

Hi!

GL_POLYGON is extremely slow. Drawing with triangles is highly optimized on most modern graphics cards.

imported_pjcozzi · November 27, 2009, 7:26am

Julien is right; using triangles is the way to go. In particular, use indexed triangle lists to take advantage of the GPU’s vertex caches.

One reason GPUs are optimized for triangles is because the end points of a triangle are on the same plane, which is not always the case for a polygon. Also, GL_POLYGON won’t render non-convex polygons correctly and it is removed in the core OpenGL 3.2 profile.

Regards,
Patrick

devdept · November 27, 2009, 1:20pm

use indexed triangle lists to take advantage of the GPU’s vertex caches

Are indexed triangle lists better of display list even if the geometry does not change?

Thanks,

Alberto

imported_pjcozzi · November 27, 2009, 3:46pm

I don’t have a solid answer to this so let me give you some background. I’ve read that NVIDIA’s display list compiler is very good and sometimes even outperforms static VBOs. I’ve never tested myself so I cannot confirm this. Although display lists were removed from the core 3.2 profile so I no longer use them.

When rendering a large amount of static geometry, I recommend using static VBOs with indexed triangle lists that are reordered for the GPU’s vertex caches. There are several reordering algorithms, see Tom Forsyth’s algorithm and Fast Triangle Reordering for Vertex Locality and Reduced Overdraw. We have implemented the later algorithm with good results.

I’ve also read that interleaved vertex attributes are faster than non-interleaved, and that unsigned short indices are faster than unsigned ints, but I haven’t noticed much difference in either case.

Note that this is just for raw rendering horsepower. You can, of course, use culling, LOD, lay down z first, etc, and then render with optimized triangle lists in a static VBO.

Take care,
Patrick

Dark_Photon · November 27, 2009, 6:41pm

…I’ve read that NVIDIA’s display list compiler is very good and sometimes even outperforms static VBOs. I’ve never tested myself so I cannot confirm this.[/QUOTE]
I have. This is definitely true. And it’s not a tiny outperformance either.

Aleksandar · November 28, 2009, 5:07am

Exactly! It depends on drivers and hardware, but even with the latest NVIDIA drivers DLs can be up to two times faster than static VBOs. I have proven it with many test-applications.

But, DLs have other draw-backs, and they are deprecated, unfortunately.
I hope that things will be better when driver-developers have less work to do (after excluding all dreprecated functionality) and optimize VBO a little bit better, so that they can achieve the speed of DLs.

devdept · November 28, 2009, 5:16am

Aleksandar

Therefore, practivcally instead of doing:

glBegin(GL_TRIANGLES);
 // first tri
 glVertex3d();
 glVertex3d();
 glVertex3d();
 // second tri
 glVertex3d();
 glVertex3d();
 glVertex3d();
glEnd();

What shall we do to use static VBOs?

Thanks,

Alberto

Aleksandar · November 28, 2009, 5:42am

I apologize for the next question, because it cannot be considered as a “beginners coding question”, but I would like to avoid starting a new thread/topic, and it is related to performance issues…

Question: Can anyone direct me to the official NVIDIA’s paper or some academic paper, preferably not older than few years, where can be found in-depth explanation of strategy for rendering on real GPUs? Or, at least, some charts depicting polygon_count/FPS dependency.

Reason: I have discovered that there is a non-linear dependency between polygon count and rendering speed. For example, I can raise number of triangles four times and the frame rendering time rises just for a third of its value (100K triangles for 7.14ms and 400K triangles for 10.4ms). All triangles are distributed in about 3K VBOs of different sizes. Of course, after some limit, for example more than 7M triangles, frame-rate dramatically drops.

Aleksandar · November 28, 2009, 6:13am

Presuming you are using fixed functionality (and glVertex*() functions calls means exactly that…), I think that the next link will help you:
http://www.opengl.org/wiki/VBO

devdept · November 28, 2009, 6:23am

Yes, it is exaclty what I was looking for.

Thanks,

Alberto

Ketracel_White · November 29, 2009, 8:56am

That will never happen because way too much software depends on the old stuff.

Anyway, VBOs have one big problem and that’s requiring the programmer to do everything and preventing the driver from really optimizing the data (I ran into that issue with a program that despite all optimizations I did still runs faster in immediate mode.) With display lists the driver can do whatever it wants and organize the data any way it likes so if done well it will naturally be faster.

Aleksandar · November 29, 2009, 9:42am

Whom are you talking about the deprecation problems?
I’ve got a lot of “old” code too. I twitched my hair when I spare many hours to “rise one old application to its feet” only with GL 3.2 Core functionality and realized that I lost a lot of functionality and didn’t gain any speed boost.

I know that old functionality will stay, and I’m glad for that. But also hope that NVIDIA/AMD will issue something like “lite” drivers only with Core functionality, where performance would be on the higher level. But the proliferation of drivers will have other problems. Who knows what the future will bring to us…

M_dm_n · November 29, 2009, 9:42am

Aleksandar:

I apologize for the next question, because it cannot be considered as a “beginners coding question”, but I would like to avoid starting a new thread/topic, and it is related to performance issues…

Question: Can anyone direct me to the official NVIDIA’s paper or some academic paper, preferably not older than few years, where can be found in-depth explanation of strategy for rendering on real GPUs? Or, at least, some charts depicting polygon_count/FPS dependency.

Reason: I have discovered that there is a non-linear dependency between polygon count and rendering speed. For example, I can raise number of triangles four times and the frame rendering time rises just for a third of its value (100K triangles for 7.14ms and 400K triangles for 10.4ms). All triangles are distributed in about 3K VBOs of different sizes. Of course, after some limit, for example more than 7M triangles, frame-rate dramatically drops.

I guess you won’t find such data, reason being multiple stages of pipeline.

You can get a bottleneck in any of the stages that is not raw geometry processing and you will be able to increase it without any problems at all, then you cross the critical point, geometry becomes the slowest link and your program takes a nose dive.

The best strategy is to think about general “best practices” when you design the program, but to only worry about performance issues of the stage when it’s the culprit of general slowdown.

Actually, everyone recommends to increase workload of other stages, like pixel shading/texture sizes, to get them on par.

But with unified shaders there is a whole new can of problems. Vertex stage will affect fragment stage and so on.

http://developer.amd.com/media/gpu_assets/PerformanceTuning.pdf

M_dm_n · November 29, 2009, 9:52am

Well, you only have to optimize if geometry transfer/processing is the bottleneck.

If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.

I don’t know how it works nowadays, but older GPUs used to reread vertex if it was shared with multiple vertices instead of recalculating it. So you have bulk vertex data, say you want to draw triangles, you pass index array and the driver starts to draw.

From index array (GL_TRIANGLES)
Use index 1
Use index 2
Use index 3
---- next tri ----
Use index 2 (taken from cache)
Use index 3 (taken from cache)
Use index 4
---- next tri ----
Use index 4 (taken from cache)
Use index 2 (taken from cache)
Use index 1 (taken from cache)

Cache was like 15 vertices long, so if you preprocess the data for reusage it can really get fast.

glDrawArrays doesn’t have this luxury.

Aleksandar · November 29, 2009, 3:40pm

Thank you, M/\dm/
!

But, those are general terms I already know. I need some reference for citation, and some starting point for my further research. I want to prove that my algorithm is good enough and I need to measure its performance. The number of rendered primitives is relatively low, but in some cases the number of functions calls can explode. I need to measure the impact of number of function calls on the drop of frame rate, but…

There are two problems with benchmarks:

cold-start
power saving

All modern processors have power management that reduces power consumption, and also execution speed, if task is not challenging. So, it is almost impossible to measure the real speed of GPU using the same test on all machines. The second problem is that speed of the test depends on the previous task.

Probably there are thousands of other problems, but those two are currently the most important for me.

Trying to solve those problems I have carried out many experiments to seek out “the row power of GPU” by finding number of triangles that can be rendered in the time-unit. I discovered some non-linear dependency and sudden drop in the “triangles per millisecond” speed when changing the size of VBOs. That was the cause of my previous question.

I’m sorry for this long post, but … last two days testing GPUs was my predominant occupation.

Aleksandar · November 29, 2009, 3:42pm

[quote="M/\dm/
"]
If it is, static_draw VBO with indexed draws and cache friendly indices will probably be as fast as display list.
[/quote]

Should be even faster than DLs, but unfortunately they are not.

M_dm_n · November 29, 2009, 11:46pm

Should be even faster than DLs, but unfortunately they are not. [/QUOTE]

Well, in that case I can’t tell you much. I haven’t been coding/researching OpenGL from 1.5 days and I’m picking everything up myself right now. There’s a lot I’ve missed.

Maybe someone from advanced forum knows the answer. Nvidia and Ati developers used to post there.

Aleksandar · November 30, 2009, 11:30am

Thank you M/\dm/
, anyway!

And I’m glad you are back to OpenGL programming again.

Ketracel_White · December 1, 2009, 3:57am

Should be even faster than DLs, but unfortunately they are not. [/QUOTE]

Why should that be faster? I can’T imagine anything being theoretically faster than having the driver create a raw list of GPU commands for a drawing operation, including vertex optimization? If implemented well I don’t think there’s anything that could get faster than a display list.