VAOs slower than glVertexAttribPointer on all implementations

I was reading Valve/nVidia presentation on the lessons learned when Valve ported Source game engine to OpenGL (

One thing that caught my eye is the statement that VAOs are slower than glVertexAttribPointer an all implementations and it’s a recommendation to skip them (page 57). So the question is, why slower?

If VAOs are indeed slower what it’s the preferable way to avoid using them on GL3 core where there is no default VAO? Maybe create a single one, bind it once and use it all the time? Will this be efficient?


PS: The presentation is a very interesting reading by the way

That’s the way some of us already use VAO. :wink:

AFAIK VAO cannot beat single (or several) glVertexAttribPointer calls. But if you have dozen or several dozens of glVertexAttribPointers, than VAO should be better solution.
It would be nice to see some concrete numbers from real-world applications comparing execution time with and without VAO.

Considering that the same paper had this chestnut:

I would consider most of what’s in that paper to be suspect on that basis alone. Not necessarily wrong. Just suspect.

Also, the NVIDIA bias is really showing (what with the heavy shilling of DSA and all).

I was wondering how you would even manage to implement vaos slower than the “manual setup”? Wouldn’t the most naive implementation be to just “replay” the same command sequence every time the vao is bound?

I was wondering how you would even manage to implement vaos slower than the “manual setup”?

Let’s assume for the moment that all hardware works exactly like D3D does. That is, vertex formats are fixed (ie: changing them is costly), but the source data for those formats is not fixed (changing them is cheap). Given that, how would you implement glVertexAttribPointer?

This function has to be able to change the vertex format and the source data. But a lot of people reuse the same format. Therefore, it would be reasonable to use a simple hashing method (ie: hash the format parameters to VAP) to check to see if they’re changing the format. If that attribute’s format is the same in this new call, then don’t change the internal format. Just change the source data.

Therefore, a series of rendering calls where you use the same vertex format, but with different source data, would perform optimally.

Now consider binding a VAO for rendering purposes. Here, you have many attributes worth of vertex format data. Plus, thanks to OpenGL’s “bind to modify”, you can’t even be sure that they’re intending to use that format data yet until they render with it. So when you switch from one VAO to another, it’s easier to just change all the format state even if the format state didn’t actually change from one VAO bind to another.

Therefore, a series of rendering calls where you use VAOs with the same format would not perform optimally.

Plus, NVIDIA doesn’t like VAOs. So they have no reason to make VAOs fast. And every reason to make VAOs slow. That way, they can create a self-fulfilling prophecy: “Look at this profiling data: VAOs are slow. Don’t use them.”

Hmm. I too would be curious as to what use cases would make VAOs slower than raw pointer and enable calls. Changing the VAO state every time perhaps? I wonder if he actually applied them properly.

His statement differs from my experience when testing VAOs. VAOs alone (for batch attr/index setup, one per static batch, with the data in VBOs) yields some speedup on NVidia (despite Alfonse’s and the Valve quote’s implication), but NV bindless (for batch attr/index setup) alone yields even more speedup. And last I checked, VAOs+bindless together were slower than bindless alone (which makes sense).

Now IINADE (I am not a driver engineer), but I suspect the reason for this is that with VAOs, you’re collecting state data in a single (likely-) contiguous state struct in the driver. That helps, and perhaps the driver can front-end load some batch setup work caching private state in the VAO, but every time you bind a VAO, you still have to go look this up from main memory amongs a bazillion others, taking the cache misses to make it available in the GL driver (i.e. this state data is not part of your app-side batch storage object which you’ve already pulled into cache). Whereas with bindless, you actually store exactly what the driver needs on the app side in your batch storage object (i.e. 64-bit GPU addresses) and there’s no reason to go do other mem lookups just to bind and enable your attr and index lists. Further, you can get the batch VBO data in a state where it is hot and ready to render with up-front, once, and then render with it many times.

There is a video with the presentation. It didn’t help clear things