Post T&L cache with modern GPU features

The Post T&L cache is still a very important and useful feature of modern hardware. However, it occurs to me that very little seems to be (publicly) known about the behavior of this cache when used in concert with a number of modern GPU stuff. So I have a couple of questions about post-T&L caches nowadays.

1: How does the post T&L cache interact with Image Load/Store and atomic counters?

I suppose this is a specification question more than an implementation one. Does the specification allow for two VS invocations (in the same rendering command) who get the same vertex ID (and same instance) to come up with difficult results due to image load/store or atomic counters?

I looked at the memory model for image load/store. There’s no way to read something in one VS invocation that another VS invocation in the same rendering command wrote. Or more to the point, every possible means you could have to do that results in undefined behavior.

There is one exception to that: atomics. Both image load/store atomics and atomic counters (I think) seem to use a different memory model. Doing an atomic increment in each invocation, and then using the result of that increment (or rather, the result returned by atomicCounterIncrement/imageAtomicAdd) for something in the vertex shader is entirely possible. And according to the memory model, it seems to be defined behavior (though which values pair with which invocations is not). Which means that it’s possible for two vertex shader invocations that use the same vertex ID and instance to come up with different results due to the atomic counter.

So what does the spec say about this? Does the implementation have to effectively opt-out of the post T&L cache if it detects the use of atomics in contributing to outputs? What about the use of atomics in contributing to writes via image load/store? Or does the spec just say that the number of invocations is undefined, and therefore you get undefined behavior from a VS that does things ultimately based on the number of invocations?

2: Post-T&L and tessellation. Does tessellation affect the post-T&L cache? If you have just a TES shader, is the use of the post T&L cache affected? If you have a TCS shader, does that turn off the cache? I could imagine it going either way, depending on how one makes one’s hardware. But has anyone actually benchmarked this?

Hi Alfonse,

the 4.5 core spec says in A.5

When a single shader type within a program accesses an atomic counter with only atomicCounterIncrement, any individual shader invocation is guaranteed to get a unique value returned.
with corrolary
While a unique value is returned to the shader, even given the same initial state vector and buffer contents, it is not guaranteed that the same unique value will be returned for each individual invocation of a shader (for example, on any single vertex, or any single fragment). It is wholly the shader writer’s responsibility to respect this constraint.

This indicates that there is a distinction between the vertex as identified by vertex ID, etc. and vertex shader invocation. There may be more shader invocations than vertices, and each invocation of the vertex shader may fetch a new atomic counter value.

Section 7.12.1 has some more information about the relation between vertices and vertex shader invocations:

While a vertex or tessellation evaluation shader will be executed at least once for each unique vertex specified by the application (vertex shaders) or generated by the tessellation primitive generator (tessellation evaluation shaders), it may be executed more than once for implementation-dependent reasons. Additionally, if the same vertex is specified multiple times in a collection of primitives (e.g., repeating an index in DrawElements), the vertex shader might be run only once.

This clearly allows the GL implementation to reuse a vertex from the T&L cache instead of running the vertex shader again, on the other hand there are no guarantees, that a T&L cache exists, or how it works. On the other hand it should be possible to make some experiments using the GL_ARB_pipeline_statistics_query extension.

The spec allows implementations to reuse a vertex that has been already processed earlier (e.g. take it from a post T&L cache) thus implementations may (and actually do) vary on how many vertex shader invocations there will be for a given draw command and thus how many times the image load/store and atomic counter operations in them are executed.

Unfortunately, there is no way currently in GL to disable this reuse possibility (i.e. disable post T&L cache).

You are correct that it depends on the hardware design as vertex reuse can be implemented in multiple ways, thus there isn’t a generic answer for your question.