Vertex Operations- efficiency

I’ve been testing performance when rendering very large quantities of vertices in OpenGL. More specifically, I’ve been rendering millions of triangles that are too small to actually show on the screen using interleaved, mapped VBOs (Just to see what I could do :slight_smile: ). I’ve done everything I can to maximize the efficiency of the rendering. Anyway, I’m a bit confused about the amount of computation which appears to be necessary for each vertex. Assuming that most of the GPUs processing power is going to dealing with the millions of vertices, It appears that these vertex operations take around 500-1000 FLOPs (floating-point operations) each. I suppose this isn’t unusual, however, since vertex operations are traditionally more expensive than texture and pixel ones. But my first question is: does anyone have any insight into the breakdown of the computations required at various steps of rendering vertices in OpenGL (Per-Vertex operations, primitive assembly, etc.) What confuses me is that, based on some work I did earlier with software rendering of 3D scenes, it doesn’t seem that simply rendering vertices without any lighting or special effects should take as much as 500-1000 FLOPs per vertex. Is there any way to improve efficiency of rendering just vertices, or does anyone understand what it is that requires so much computation?

Thanks in advance,
Alex

btw I’m using a GeForce GT 430; I know it isn’t very powerful but that shouldn’t have any large effect on efficiency

Question is, how do you know that assumption is valid?

Could you share some code to help verify that? If you are testing with pre-built display lists (or if on NVidia, VBOs+bindless) and the batches are large, that might be a good assumption. If not, there’s more doubt.

btw I’m using a GeForce GT 430; I know it isn’t very powerful but that shouldn’t have any large effect on efficiency

Based on some info Ilian Dinev posted a while back, that card’s 700MHz graphics core freq implies 700 mtris/sec with standard (non-tesselation) triangle rasterization. This has to do with a triangle setup limit with newer GeForces of 1 tri/clock. Might compare against what you’re getting and treat this as a maximum (for non-tesselation rendering).

Multiply that 700 by 2, as GT430 has 2 SMs.

A multiplication of 4x4 matrix by vec4 is < 23 cycles, and it’s usually the costly operation in a vtx shader. So, you’re probably getting limited by the triangle setup.

But meanwhile, the GT430 is low on memory bandwidth, so that could be another reason.

P.S. Tesselation should also be governed by that limit, I think; it’s just that GTX580 has 16 SMs for 772*16 = 12352 million tris/second; thus the limitation isn’t felt there.

Is it really per SM? Seems like I read on this Beyond3D thread (see post #7) that though Fermi’s have 4 rasterizers, on GeForce specifically, only 1 is used for standard triangle rasterization. So you only get a 1X. It’s only with tessellation that all 4 are used, giving you 4X.

Did I totally misread that thread?

However, I’m no GPU internals guru. If you can point me to where you’ve read the tri setup vs. SM relation, I’d really like to get a link from you and read it too.

Thanks!

I am not sure, actually. I only knew that with tessellation all rasterizers work, but I’ve forgotten how many they are and whether it’s still valid for simple geometry.
Plus, couldn’t test it myself as my fermi card kinda died on day one of research.

700M triangles/sec would be quite an improvement. From the testing I’ve done I seem to be maxing at around 84M/sec, which I suppose isn’t too bad, but I still wonder if there isn’t any way to increase that a few times, as it seems it should be possible.

I’ve tested with VBOs and display lists. I tried with just one VBO with all the data, and also with 130 VBOs, each with 33,000 vertices, which is my card’s number for GL12.GL_MAX_ELEMENTS_VERTICES, but it didn’t make any noticeable difference.

I’m pretty sure memory isn’t the bottleneck, and if it means anything, when I run it, GPU-z tells me that the GPU-load is 99% while the memory controller load is around 60%.

The code I’m running is just very basic render code, written with LWJGL. For one of my tests the rendering was just this:

static void drawVBOInterleavedMapped()
{
glEnableClientState(GL_VERTEX_ARRAY);
glEnableClientState(GL_COLOR_ARRAY);

  for(int num=0;num&lt;VBOS;num++)
  {
  glBindBufferARB(GL_ARRAY_BUFFER_ARB, Handle[num]);

  glVertexPointer(3, GL_FLOAT,6 &lt;&lt; 2, 0L &lt;&lt; 2);
  glColorPointer(3, GL_FLOAT, 6 &lt;&lt; 2, 3 &lt;&lt; 2);

  glDrawArrays(GL_TRIANGLES, 0, 33000);
  
  glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
  }

  glDisableClientState(GL_COLOR_ARRAY);
  glDisableClientState(GL_VERTEX_ARRAY);

}

The VBOs aren’t bindless, but from what I’ve seen, that doesn’t seem to make a significant difference except with very large numbers of VBOs.

Is there any other possible ways I could increase performance (hopefully I’m not doing anything stupid), or do you think that this is just the best my GPU can do?

Interesting; if you provide an executable/jar we could test-out on different gpus for comparison.

Meanwhile, your calculations match-up:
268.8 GFLOPS / (1000 flops * 3) = 89 mil tris/s

89 / (130 * 11000) = 62 fps
(just in case, I have to ask - vsync is disabled, right? )

You mentioned “mapped VBOs”, are you uploading all VBO data per frame? (and how many megabytes)

Sorry for the delay, busy times…

Yes, vsync is disabled, and I mapped all the data using glMapBufferARB, but only once in initialization. Also, I originally tried without mapping the data and I don’t think it made much of a difference.

So here’s my update on performance:
The only change I’ve found that increases performance at all is the use of glPrimitiveRestartIndex. For my implementation, it increased performance by nearly 30%, bringing triangles/second to over 100M. I think that’s about as good as it will get for my GPU; another 30% performance increase and I would run into problems from the memory controller load anyway.

The executable jar sounds like a cool idea. I’ll see what I can do :slight_smile:

Few more things to add:

one is that I also tried GL_TRIANGLE_STRIP rather than GL_TRIANGLES to see the difference. GL_TRIANGLE_STRIP has the effect of rendering basically 3 times more triangles with the same number of vertices. Here’s some performance data I gathered when doing so:
GL_TRIANLGES: 81.5M tri/sec 244.5M vertices/sec
GL_TRIANGLE_STRIP: 175.9M tri/sec 175.9M vertices/sec

so that might give some insight into the computational cost of per triangle operations compared to per vertex operations

second thing to add is that I made that executable jar file that basically tests how many triangles can be rendered per second using GL_TRIANGLES and using the stuff like PrimitiveRestartIndex that can improve efficiency, as long as whatever hardware it’s running on supports it. It only works on windows and gives an error if the java version is really out of date, but other than that, I got it to run on a laptop with an AMD Radeon HD 6310, and it rendered around 50M triangles per second.

So the executable jar is pretty much ready but I don’t know where I would put it / attach it

If it is not too big you can attach it here on the forum.

Well, here’s the executable jar by itself… but it won’t run without lwjgl.jar and the lwjgl.dll native, and they were too big to upload =| (lwjgl.jar is 900KB by itself)

it’s not too hard to just download them though:
Here’s the lwjgl download

it’s lwjgl-2.8.3.zip

if you put the executable jar (unzipped of course), lwjgl.jar, and lwjgl.dll (for whatever operating system) in the same folder it should run.

Otherwise, if someone has an idea for sharing a larger file, I have the executable jar with everything packed inside of it but it’s 995KB.

270 mtri/s on GTX275, which I thought would be strange; should be 600+ ,
but then I saw you render just 1 million tris per frame, in one drawcall.

I just stuck with one million triangles in one VBO and then calculated the frame rate to be simple. For my graphics card (and lower performance ones) it didn’t make any difference what the size of the VBOs were. I suppose having all the draw calls per second is bringing down the performance there; it’d be pretty simple to just change the VBO size, if that would work. I’ll put a new one here

Otherwise, feel free to edit it.

That one crashes; but I modified the first one to do more drawcalls, and the result stays 270.

So, I guess it’s the fixed-function state that causes this.

Do you think customized shaders could improve efficiency?