VBO fastpaths

Hi

I just started a new program and am in the early stages of rendering.

Currently i upload interleaved position and color information into the VBO. That’s 12 bytes position, then 4 bytes color, makes 16 bytes.

I also upload an index array into another VBO and then render parts of the data using glDrawRangeElements.

Now all is fine, and i get around 30 Mio Triangles per second, but up to 45 Mio. Usually no less than 20.

However, i now added a normal array (additional 12 bytes) and speed dropped to less than 10 mio tris/sec.

So i thought, ok it is not 16 byte aligned, that might be a problem. Therefore i checked, adding 2 2-component texture-coordinates, and no normals.

That means 12 byte position + 4 byte color + 8 byte texcoord0 + 8 byte texcoord1 = 32 bytes.
However, speed stays at around 10 mio tris/sec, sometimes even less. Nothing else changed in the code.

This is on a Radeon 1600 Mobility using latest drivers. I would expect a much higher throughput, even with 64 or 96 bytes per vertex.

Any suggestions? What speed do you get at which array layouts? What speed should i expect?

Thanks,
Jan.

Try using non-interleaved arrays.
I was wondering if I should switch to interleaved arrays recently, but after some tests and thoughts my conclusion is that interleaved arrays don’t give any advantages.
Another fact is that if you want to use vertex attrib arrays you’re forced to use no-interleaved arrays and I guess that’s the most optimized case today.
Imagine vertex processor that can support only one cache area - interleaved arrays will help alot.
Now imagine GPU that has separate cache for each attribute - using interleaved array would make GPU use only one cache unit (at a time) and swap more often. Having multiple cache units allow to read all vertex attributes simultaneously into registers.
That’s just an example of course - I’m not referring to any real GPU here, but I guess my considerations are not far from truth and therefore worth reading.

My experiences are quite the opposite and having interleaved data instead of individial arrays should use caches better. Think cache lines.
If there is a fixed number of bytes read per fetch and you read individual vertex attributes from all over the place, the cache lines are not fully used although they have been fetched.

Yes, i also always thought interleaved data would be faster, therefore i used it.
I did a quick test, and with non-interleaved data i get constantly 20 mio tris/sec, even with 64 bytes per vertex. That’s twice the throughput as i get with interleaved data.

Except for 16 bytes per vertex (12 bytes position, 4 bytes color), with NON-interleaved data, throughput reaches up to 45 mio tris/sec, too. Seems the card just loves 16 byte per vertex, no matter how it is arranged in memory. This also confuses me, because with interleaved data i though it would be that fast, because the data is nicely aligned. With non-interleaved data, the position array is not aligned, still it seems not to be an issue.

Still i think this is pretty slow. I read somewhere that the Radeon 9700 has a peak-throughput of 325 million triangles per second (PR statistics). I also read somewhere else, that some guy reached 56 mio tris/sec with his ROAM algorithm on his X1600.

So, i don’t use a complex shader, i don’t use texturing, at all, it’s basically only transform and a dot-product for simple shading, my index-array is a static VBO too, vertex-access is 100% sequential, i think i should be reaching at least twice the speed.

Any other tips?

Thanks,
Jan.

I should add a bit of information:

I have a set of about 700000 triangles. I optimize them, using an octree. When i zoom out, i render about 600000 of the triangles using about 50 drawcalls. The number of drawcalls seems to be unimportant. If i cap the maximum of vertices sent in one call, i can use hundreds of drawcalls and it does not affect performance.

Throughput drops, as i get close to my model and much of the data is optimized away by the octree. I assume the card has just not enough to do, frames per second are pretty high then, however.

Jan.

That’s twice the throughput as i get with interleaved data.

Vsync off?

Still i think this is pretty slow. I read somewhere that the Radeon 9700 has a peak-throughput of 325 million triangles per second (PR statistics). I also read somewhere else, that some guy reached 56 mio tris/sec with his ROAM algorithm on his X1600.
But that probably wasn’t on a notebook chip?
I wouldn’t expect a notebook chip to be equally fast as his desktop twin of the same name.

You could isolate your bottleneck more.
Make the triangles single pixel size, shrink the window, or enable front and back culling, try without vertex shader.

V-Sync is off. I disabled shading, didn’t change anything.
Resolution is 800600. I tried 1024768 and 640*480, no difference either.

I enabled front and back culling, speed seems to increase a tiny bit, but not enough to be sure about it.

Another thing i did, was to limit my mesh to 65000 vertices and then use 16 Bit indices. It did increase speed to 30 mio tris/sec (i had to render the mesh 40 times to actually send an equal amount of vertices to it, as before).

Seems slicing up the mesh and using 16 Bit indices would be worth it.

I don’t utilize pre- and post TnL-cache right now, since all triangles use unique vertices. Most of the data is build from quads, so this would be another area for improvement. However, would you consider 30 mio tris/sec under these circumstances a good performance for a laptop card?

The X1600 Mobility is actually pretty fast. I can play even modern games in high details without problems (Rainbow Six Vegas (Unreal Engine 3) runs perfectly smooth). However, i have no numbers to compare, whether my engine runs fast or slow. My gut tells me it should be possible to make it faster. Does anyone know what throughput other engines reach? Or how many triangles current games render per frame (at interactive frame-rates)? I am rendering 660K triangles at 35 fps.

Jan.

My experiences are quite the opposite
Can you tell us what GPU’s have you tried?
I only tested on GeForce 7800GT. Perhaps these are designed with non-interleaved arrays in mind. Perhaps :slight_smile:

and having interleaved data instead of individial arrays should use caches better. Think cache lines.
True if there’s one cache for all attributes.

Note that GeForce 7800 has max 16 vertex attributes. I guess it’s not a problem to have 16 small separate chaches (optimized for 4*float) for vertex attribs - GPU’s are targetting for maximum parallelism, right?
In that case the performance loss you mentioned would be 100% covered with fetching all vertex attribs at once. That’s what I meant. It also seems more flexible solution than a predefined set of vertex array formats so I guess that’s the direction GPU vendore are heading.

Also, there would be almost no penalty when fetching data from memory to 16 small caches comparing to 1 large cache. Of course reading 1024x1 byte will be slower than 1x1024 bytes because memory can have some kind of automatic address incrementation, but when comparing 16x64 vs 1x1024 you only get 15 address prediction penalties per 1024 bytes.

But, yeah - this is just my speculation and my personal explanation to why my test on GeForce 7800 indicated that interleaved arrays introduce no benefit on this particular GPU when preforming this particular test :slight_smile:
My test was a grid of 4x16x16 very small cubes (no vertex sharing so it was 24K vertices) - each one taking just few pixels on the screen. No clipping (entire grid of cubes inside frustum at all times) - speed difference was somwehere around typical measurement error. With larger polygons and complex shaders it would simply fade away.

Just curious: is your VBO data static, dynamic, or streamed? If it’s not static, I’d suggest trying it for comparison at least.

I’m also wondering what impact, if any, interleaved vs. non has on the vertex post-transformation cache. It’s an entirely different kind of cache, but I’m wondering if it’s also affected by different vertex formats, e.g., is the cache size constant and the number of vertices variable depending on size, or are the vertices cached with dummy values for non-bound data types, or are there fast-paths for certain arrangements?

I don’t know. But it’s worth checking raw triangle optimization to see if you’re getting the most out of your vertex cache as well.

BTW, I’m also generally switching over to non-interleaved data, though I’m using a system that leaves it flexible under a uniform API for both. My theory, as yet unbenchmarked, is that it will be variable enough that I’ll want to pick some layout at runtime based on the individual HW.

The data is static and so is the VBO. It is also only bound once to the pipeline, at program startup, i don’t unbind and rebind any buffers currently (though that will change in the future).

Static here, too.

Okay. I wish I had time to try some benchmarks with you. This is a big design issue for me too, but I’m mired in other stuff right now.

The thing I’d try next, in case you also want to give it a shot, is to test indexed vs. non-indexed triangles, or at least test good indexing vs. bad.

The reason is, in the non-indexed case, the data will stream through the transform stages in-memory-order. In the indexed case, I’d expect the post-transform cache to hide a lot of the memory hits, i.e., whenever the cache returns a reusable vertex. And when a cache miss occurs, the memory fetch may not be in as much coherent-order, which may or may not matter, depending on the cache line configuration on that particular HW. So noticing different timings for these two cases may shed some light on how much the memory fetch costs (coherent vs. non-coherent, which would seem to apply to interleaved vs. non-interleaved, depending on the # of cache lines available).

The trick will be in normalizing the results. The non-indexed case may contain redundant transforms. So it should be slower regardless. I’d probably determine the performance ratio for the same geometry (indexed vs. non-indexed) to give me a rough idea of how much benefit the T&L cache is getting. Once you have this ratio, you could apply it to four combinations of indexed/non and interleaved/non to hopefully see some impact.

To test “bad” indexing instead, you could probably just design a pattern for your indexed case that really pushes memory fetching outside of normal bounds, both in terms of raw mem and the T&L cache. My first stab at that pattern would be something that jumps randomly around the VBO (3 verts at a time to make good triangles) with no-reuse between triangles.

BTW, you can also gage the T&L cache’s expected use with a dummy SW cache that just tracks the N most-recently used indices and counts the number of times they’re found in the cache vs. missed. The misses are what counts. And you’ll need to have very good cache utilization to reach the marketing numbers for your HW.

Also just FYI, since cubes have 8 vertices, the T&L cache re-use might be somewhat a-typical–probably about 2 hits and 1 miss for each 8 verts in a set (e…g, 8 misses / 24 indices = 33% miss rate). Something “meshier” might be better to test with. I think they often use a uniform grid of small indexed triangles or non-indexed, decent-length strips to get the best marketing numbers.

Does that make sense? I’m also guessing like you are, so if someone with more/any HW design experience wants to chime in, that would be most helpful. This would be so much easier if the HW companies just told us the optimal input patterns for their various HW.