Using same VAO for different passes (performance)

noizex · September 26, 2012, 9:15am

Hello,

I have question that I couldn’t find answer for anywhere, so I decided to ask here. Lets assume I have typical VBO with 4 attributes: position, texcoord, normal, color. I set up VAO that encapsulates it and sets attrib pointers and enable arrays once and for good.

Now, I use this VAO to render my geometry in typical pass, with a shader that uses all 4 attributes. This is usual scenario.

Now, I’m thinking about passes that use the very same VAO but don’t require all 4 attributes. Lets say early z-pass or shadow map rendering which could use only position attribute.

So the ultimate question is - does the driver cleverly optimize it seeing that current program doesn’t use all attributes it has enabled and don’t send the data to vertex shader, or it send all of it, even though its not needed? I’m curious, because if it sends it, then maybe its worth optimizing by having multiple VAOs with different attributes enabled (using the same VBO), or even going as far as splitting VBO and instead of one interleaved VBO use one VBO per attribute - I know this is opposite to what all sources say, because interleaved data is good. This would also require multiple VAOs, one for full rendering, one for position-only rendering etc.

What is the common approach to this - do people even care about such optimization? If rendering geometry multiple times can be optimized by using only vertex data thats actually needed, maybe its worth dividing VBOs and having more than 1 VAO?

malexander · September 26, 2012, 11:24am

If you have a shadow shader which only has 1 input, position, then yes, the vertex shader will only fetch position. I wouldn’t think the fact that other data is interleaved in the VBO would make any performance difference, but that would be something you’d need to verify by benchmarking a packed position VBO with your interleaved one.

Many GLSL compilers will optimize out unused vertex shader inputs, but it’s easier in the shadow case if you just don’t declare the others at all.

mhagain · September 26, 2012, 4:27pm

I would expect that as jumping to the next vertex in an interleaved array setup is just a matter of “vertexptr += stride” (to express a hardware op in C code) then it’s not really going to be such a big deal. However, I’m not aware if OpenGL specifies any behaviour for this kind of setup, and specifying any behaviour for it would be well outside the scope of OpenGL anyway, so you’re getting into implementation-dependent territory as regards performance.

Where it may definitely trip you up is if the hardware has to shuffle any GPU memory resources around, e.g. if it has to swap out textures in order to swap in a larger VBO. If you were tight on video RAM this would hurt for sure.

Alfonse_Reinheart · September 26, 2012, 5:18pm

Well, it all rather depends on how big your vertex data is.

Hardware doesn’t read individual attributes; it reads memory cache lines and then culls the attributes of interest from those. Cache lines are probably 32-64 bytes in size, which is 8-16 floats. If your vertex data fits in a single 32-byte cache line, then one attribute from the same interleaved data will almost certainly require the same GPU read performance as 4. Remember: the bottleneck is in the memory read, not the attribute decoding.

Having a 32-byte vertex size is entirely possible. Positions can be 3 floats, normals can be stored in 10/10/10/2 format in a 4-byte int, colors can be 4 bytes, and texture coordinates can be two shorts or 4 bytes. Total size: 24 bytes, so that’s plenty of room for binormals/bitangents or another set of texture coordinates or colors. In general, you should do what you can to keep your vertex data to one cache line per-vertex. It doesn’t need to be aligned to a cache line.

If your data is 64-bytes in size, then it is possible that only pulling one attribute will make no difference in performance. It depends on how the hardware works. Looking at ARB_vertex_attrib_binding, it is possible that some hardware may use the stride as a way of saying, “every vertex pulled from this buffer takes X memory. So always fetch X memory addresses.”

Then again, it’s possible that only pulling one attribute will result in no performance difference.

mhagain · September 27, 2012, 9:00am

There’s also the point of how much you really need such a (theoretical) optimization in proportion to overall. If by splitting off position into it’s own VBO you get a few percent extra on (say) shadows, but at the expense of a few percent loss in the more general case, is it worth it? Optimizing one part of your renderer in isolation can lead to trouble like this; you need to benchmark the entire thing and make decisions like “it doesn’t matter if shadows go a little bit slower than their theoretical max because by accepting that speed hit I get a speed gain elsewhere and on balance I come out faster overall than if I had split them off” - or maybe not, depending on your target hardware.

Keeping them in the same buffer and accepting that you’re going to be throwing some extra, unused, data at the GPU can also give you better code cleanliness by e.g. being able to have a single code path for shadowed and unshadowed objects. That can make the job of debugging and adding new functionality a lot easier over time, which is quite a valuable thing for you, even if not directly visible to your end-users.