somebody told me recently that VAO’s are inefficient (due to cache misses or something) and one should only use one global VAO for maximum efficiency. This sounds dubious to me and before I follow some random rumors I would like to ask the people with actual knowledge. Is this rumor true? Are VAO’s inefficient? Or are these rumors just artifacts from the past?
Thanks in advance.
There are performance tests that show that using one VAO per object does not yield improved performance vs. setting the state manually. What we don’t have is performance tests that compare more reasonable VAO usage. By which I mean using VAOs to store the vertex formats, and only changing VAOs when the format itself changes. To source from different buffers, you have to [separate your format from your buffers](https://www.opengl.org/wiki/Separate Attribute Format). That’s relatively new to OpenGL, so there’s not a lot of performance testing on using it with regard to VAOs.
As with all such questions, your best bet is to just try them and see on the GPUs that are important to you. It’s going to vary based on GPU, driver, and platform.
Years ago when I tried them on high-end NVidia GPUs on the desktop (with one VAO per batch and static batches; so setup once, reuse many) I got a decent speed-up. But I got even more of a speed-up (as much as 2X) by using NVidia bindless vertex attributes – instead of VAOs – to tell the driver directly where the data for the batch was located in its 64-bit virtual memory address space. Intuitively it makes sense why this would be more efficient. With VAOs, there’s a bunch of little VAO objects that need to be looked up and which can generate cache misses. With bindless, you basically cache (what I suspect is close to) the underlying VAO state in your application’s batch object. You also ensure that the batch data is already in a location where it’s “ready to render” on the GPU and so little or no “prep work” is required internally.
More recently we’ve tested the performance of VAOs on PowerVR GPU drivers (again one VAO per batch, static batches), and the results where underwhelming.
So I’d test the performance on the GPUs you have access to with your use case and see! It’s almost trivial to add VAO support. And please do post the results for others to benefit from.
Note that using one global VAO across all batches (with no reuse) is basically not using a VAO, so I wouldn’t expect a speed difference between these cases. But you never know – try it and see!
On OpenGL / OpenGL ES, it’s try, test, and tune per platform… Here’s hoping Vulkan command buffers reduces the amount of this iteration required. We’ll see.
It wasnt really a practical question and more of a theoretical question. I was thinking that graphics card producers would try to optimize VAO’s since they have been added to the specs and as time goes by there would be no drawbacks to them, or perhaps even benefits because of the optimization, but perhaps I am overly optimistic.
Ben Supnik’s got a good article up on his blog that partially addresses VAOs. Worth a read: