Running a compute shader with a large array makes other shaders with arrays faster

I’ve stumbled upon some strange behaviour when playing with compute shaders. I have a basic OIT implementation using linked lists (although my guess is any application which uses a fairly large local array in a shader will exhibit the following effect). In a fragment shader while rendering a full screen quad, a vec4 array is filled with fragment data for that pixel from main video memory, sorted, then alpha blended. Using the local array and sorting it is the bottleneck of the whole app. If I then run a compute shader, which declares a large local array, just once, I get a speedup of ~40% in OIT for the remainder of the application’s life. Restarting the app brings back the normal speed. Sorting, or some expensive operation, is necessary to notice speedup. The compute shader I use is below. Note the “random” uniform which is always zero, but required to stop the array being compiled away. myBigArray must be sufficiently large or the speedup is not observed.

I have a GTX 670, 313.18 drivers. Same thing happens on a 660, I’ve tried with a few other drivers too.

#version 430
layout(local_size_x = 1,  local_size_y = 1, local_size_z = 1) in;
uniform int random;
layout(rgba8) uniform image2D someTexture;
void main()
    vec4 myBigArray[256];
    myBigArray[0] = vec4(1,0,0,1);
    imageStore(someTexture, ivec2(0), myBigArray[random]);

Again, if I run the above compute shader calling glDispatchCompute(1, 1, 1) once (even binding zero to the image unit), other shaders using large local arrays speed up for the remainder of the application. It must be a compute shader - vertex or fragment shaders do not trigger this.

I can only guess some state change is triggered by using a compute shader with a big array, enabling an optimization which then gets applied to other shaders too.

Has anyone else noticed this? Can you speculate on a cause?

Interesting observtion. My speculation is that the gpu memory is being locked by the compute shader call and reused by subsequent shaders. Do you get the same improvement if the vertex/fragment shader uses a local array that is noticable bigger than the one allocated in the compute shader?

Running the compute shader seems to give a performance boost to the OIT shader, if the OIT shader uses a large array. If the OIT shader uses a small array (<32) the performance actually drops. In general, bigger compute shader arrays seem to trigger the effect (and different magnitudes), which in turn gives better performance to bigger arrays in the OIT shader.

With OIT using and sorting a vec4[64] or vec4[128], the compute shader vec4 array must have 113 elements (with 112, no effect is observed). Increasing the compute shader array size gives further, smaller, performance boosts.

With a compute shader declaring 113 elements and OIT using 64, a 33% boost is observed. Then if OIT uses 128 elements, a 44% boost is observed.

With OIT using sizes between 64 and 128, the value required to trigger the initial effect is different (eg. 100 requires 177 and 120 requires, 213).

What do you mean by locking the memory?

I am just guessing- but when a shader is loaded the local memory has to be allocated presumeably using the local memory buffers the same as CUDA/OpenCL. I don’t know how the driver does this but I was thinking that maybe it can take a short cut if the Computer Shader has already been run. It would seem unlikely this is by design so coding around this “feature” may not have long term benefits unless the driver writers notice it and can make it more consistent.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.