Vertex Cache Replacement

It occurs to me that the traditional FIFO-like post-transform vertex-cache has vanished on recent GPUs, such as Nvidia’s GF100.

However, for indexed primitives, I still observe a definite and remarkable performance advantage when exploiting temporal locality of indices and spatial locality of vertex attributes (positions, normal vectors and so forth).

Does anybody have any publicly available information on what rules you need to apply to get as much performance as possible on current GPUs, such as Nvidia GeForce 4XX or 5XX? How much locality is necessary? What are the GPU mechanisms applied on vertex attributes and primitives to gain extra performance by exploiting locality?

It occurs to me that the traditional FIFO-like post-transform vertex-cache has vanished on recent GPUs, such as Nvidia’s GF100.

No it hasn’t. What gave you that idea?

It may not be the exact same hardware with the exact logic of previous hardware (it’s more of a traditional memory cache now), but it does the same job.

That’s exactly my question! How do these caches function and how to order primitive to get the most out of them?

On previous cards, it was documented that a vertex shader has a FIFO-like cache. I am pretty sure that this has gone and, as you said, replaced by something else. The question is: By what has it been replaced? Can’t find any documentation on it.

I have also noticed lower importance of “better usage” of vertex post-transform cache on new graphics cards, but it is not documented in any article (at least I didn’t find any). That is probably the result of greater cache size and greater number of processing units used in parallel. Also unified architecture changed significantly the organization of processing on GPUs.

On the other hand, I’m not quite sure that it is documented that the cache was always organized as FIFO. It is easier for implementation, but almost all software algorithms used LRU mechanism in order to better suite different cache sizes. They are called cache oblivious techniques, because they don’t depend on cache size.

Having hundreds of processing units in new graphics cards makes prediction much harder. I don’t want to paraphrase some excellent remarks of Mark Kilgard considering the complexity of vertex distribution in modern GPUs (Section 7.2. from Modern OpenGL Usage: Using Vertex Buffer Objects Well, Mark J. Kilgard, NVIDIA Corporation, Sep. 2008.)

Thanks! That is exactly what I needed. Based on some educated and pathalogic tests I guessed that something like the stuff mentioned in Section 7.2 actually happens on my GeForce470.
Now I’ve got something that I can in fact cite! Great!