Implement memoization/caching for shader

I wanted to know how to go about implementing memoization/caching for an existing fragment shader to improve power performance.

A better question might be why you think memoization/caching is useful for shaders in general. Shader invocations are very ephemeral, so it’s not clear when you would be memoizing something. And in cases where it’s useful, it’s generally pretty obvious how to go about it (compute it once, store it in a variable, use the variable many times in the invocation).

Thank You for replying. If I have an edge detection fragment shader, what factors should I keep in mind?

If I want to implement a cache, should I create a new .cpp file and create objects to both vertex shader and fragment shader in that? I read few materials( research papers, blogs) and I want to implement it. I couldn’t find any code thought. So, I am not sure how to start.

Are there any code samples for implementing caching/memoization in shader using glsl in mobile GPU?
I could only find hardware methods online:
For Example)

Eliminating Redundant Fragment Shader Executions on a Mobile GPU via Hardware Memoization

@co.der, please do not cross-post. This is explicitly against the Forum Posting Guidelines.

I’ve combined your recent post to the OpenGL forum with your pre-existing thread here in OpenGL: Advanced Coding. I’ve also added links to the paper you mentioned and its presentation, for the reader’s benefit.

1 Like

How costly is texture() function in glsl? Can the performance be improved by using a LUT?

A texture is a look-up table, just a slightly fancier one. The cost of a texture access is typically dominated by the cost of fetching data from memory.

Of course, the circumstances around such memory accesses could change things. For example, if the same shader instance is fetching the same texture with the same coordinates, then obviously yes, it’d cheaper to just cache the value locally (though it is also possible a compiler could detect this and do it for you, but I wouldn’t bet on it). If you have multiple invocations of the same shader accessing the same data (the light passes of a deferred rendering), it could be advantageous to fold as many of those invocations into a loop within a single invocation. But there are limits to the utility of that as well.

1 Like

Could you elaborate on “old as many of those invocations into a loop within a single invocation” please?

I am implementing LUT using texture2d() function. I want to extend this to window memoization ,i.e., neighboring pixels hash to the same entry in the LUT. How can I do this?

I have a texture function:

uniform sampler2D tex;
vec4 color;
color = texture(tex, texCoord.xy);

I want to replace texture() with LUT:

uniform sampler2D lookupTable;


  1. How can I store values in lookupTable?
  2. How can I fetch values from lookupTable?
  3. How can I retrieve MSB and LSB from the texels?

You want to replace a texture fetch with… a texture fetch (using an outdated texture function is still doing a texture fetch).

It’s not “caching” to replace a memory access with a memory access. It’s not “memoization” to replace a slow operation with the same slow operation.

Caching works by taking a slow memory access and turning it into a fast cache access. But caches are only fast because they’re not main memory; they’re stored locally.

Memoization works by doing a heavy computation once and storing the result, using the stored result later instead of doing the heavy computation. But again, that’s only faster if the cost of the heavy computation is slower than the cost of storing and reading the data.

As previously stated, the thing that makes a texture fetch slow is that you’re reading from memory. You’re not going to make reading from memory faster by reading from memory.

Thank You for replying, Could you tell me which container to use for building an LUT please? I want to store the coordinates as key and color as value … and later use the key to fetch the color .

I found this code in quora

float inverse_f(float r)

    // Build a lookup table on the radius, as a fixed-size table.
    // We will use a vec3 since we will store the multipled number in the Z coordinate.
    // So to recap: x will be the radius, y will be the f(x) distortion, and Z will be x * y;
    vec3[32] lut;

    // Flame has no overflow bbox so we can safely max out at the image edge, plus some cushion
    float max_r = sqrt((adsk_input1_frameratio * adsk_input1_frameratio) + 1) + 1;
    float incr = max_r / 32;
    float lut_r = 0;
    float f;
    for(int i=0; i < 32; i++) {
        f = distortion_f(lut_r);
        lut[i] = vec3(lut_r, f, lut_r * f);
        lut_r += incr;

    float df;
    float dr;
    float t;

    // Now find the nehgbouring elements
    for(int i=0; i < 32; i++) {
        if(lut[i].z > r && lut[i-1].z < r) {
            // found!
            df = lut[i+1].y - lut[i].y;
            dr = lut[i+1].z - lut[i].z;
            t = (r - lut[i].z) / dr;
            return lut[i].y + (df * t);

Is this a good idea efficiency wise?

No, you don’t. I know that you think you do, but you don’t. I cannot imagine a shader invocation and memory access pattern where this kind of arbitrary, unstructured coding will not end in tears (ie: being slower than doing the obvious). Except maybe in compute shaders with shared variables, but even then, it would have to be really specific to the particular algorithm.

In what sense? Where and how are you using this?

The thing you’re not understanding is that, if you care about shader performance, you first must understand how shaders work. What shader invocations are, how they deal with memory, how multiple invocations in flight communicate, which operations are actually slow, and so on. You can’t just say “reading memory is slow, so I’ll cache it!” and declare victory. That’s a very easy way to kill you performance depending on the specific nature of what it is you’re actually doing.

Until you have a working shader program that does what you want (and profiling to tell you that it is too slow and where it is too slow), you are not ready to attempt to figure out how to optimize it. Not unless you truly understand how GPU execution takes place. And thus far, you don’t seem to.

Please stop trying to optimize code this way.