After thinking about it for a while, this opens up possibilities. And serious problems.
Now, everything I said in my original post may still be valid. That is, when I explained what I thought was the reasoning behind bindless graphics for rendering, that may still be correct. Indeed, I imagine it’s a significant cache issue one way or another. The lock API as it currently stands may get, say, 80% of the performance of bindless.
However, there is also this potential problem. That, for whatever reason, vertex format changes in hardware take more performance than changing the buffers used by that rendering.
The examples in the bindless graphics spec suggest this is the case. But consider this.
The justification for bindless graphics was as a cache issue, not an issue with vertex formats being attached to the GPU addresses for them. Specifically, this was the CPU’s cache. How exactly does the vertex format affect the CPU’s cache?
It may be the case that there’s simply more data. That FIFO chunk I mentioned, if you’re using the same vertex format, would be smaller than if you changed vertex formats. Vertex format information takes up room that’s clearly larger than the GPU addresses that are the the source of those attributes.
Cache lines these days are 64-bytes. That’s big enough for 16 32-bit values (the buffer addresses, if every one of the 16 attributes comes from a different buffer). So in the worst-possible case, you’re guaranteed that the format+address data will be larger than one cache line.
Really, I think the only way to know is to test it. To write an application that completely flushes the CPU’s cache. Then have it do some rendering stuff. One way with the “common” form of bindless (one vertex format, lots of pointer changes). Then with constant format changes, once per render operation. And see what is fastest. The mesh data itself isn’t at issue; indeed, it’s better to just render a single triangle from 200,000 buffer objects. And of course, cull all fragments.
Unfortunately, my knowledge of cache architecture on x86 chips is insufficient to do something that actually flushes the cache fully.
Also, this won’t answer the other important question: is this an NVIDIA-only issue, or is this something that ATI implementations could use some help on too?