I have asked this question before on this forum but the question was not very well presented, buried inside a thread on another topic.
I’m trying to see if it is faster to
a) load some data from a TBO in a vertex (my scenario is a tess control) shader and pass these values to the fragment shader as ‘flat out’, eg. flat varyings,
b) not do anything special in the vertex (or tess control) shader, and just load the values for every fragment in the fragment shader
According to the AMD Evergreen GPU reference doc (*), varyings use the LDS (Local Data Share) memory space of the GPU. LDS is said to be twice as fast as L1 cache (http://devgurus.amd.com/thread/158895)
For a) I have checked the GPU assembly code generated with ShaderAnalyzer, varyings use the INTERP_LOAD_P0 instruction and read the varying value into a GPU register. So 1 register is used. INTERP_LOAD seems to be just a LDS load instruction, with no hardware interpolation.
For b) VFETCH is used (the result of texelFetch()), and 1 register is also used as a recipient of the read operation.
As you can see the thing I am worried with is the number of GPU registers being used, which can dramatically reduce performance.
I can’t think of any reason why using varyings would use more registers - can you?
Of course, I can profile too but sometimes it’s good to get some technical insight.
The equivalent of the LDS in nVidia terminology seems to be Shared Memory, doesn’t it? The reason why Shared Memory is so effective, compared to the general L1 cache, is that it is optimized for concurrent thread access (I mean, each thread can simultaneously access its shared memory storage)?
Feel free here to tell me where I could possibly be wrong here.