imageLoad/imageStore VERY slow

Hello folks,

I need to write shaders that build per-pixel-linked lists, sort the lists (one list per pixel) and draw the result. This procedure is, in literature, known as ‘Order-Independent Transparency’.

My fragment shader that builds a list for the current pixel is:


in vec4 gl_FragCoord;

out vec4 out_col;

uniform layout(binding=0, r32i) coherent iimage2D           head_buffer;
uniform layout(binding=1, r32i) coherent iimageBuffer       next_buffer;
uniform layout(binding=2, rgba32f) coherent imageBuffer     data_buffer;
uniform layout(binding=3, offset=0) atomic_uint             ac;

void main()
{
    int index = int(atomicCounterIncrement(ac));
    if (index >= 1024 * 768 * 16) // this is the maximal number of elements in [head/next/data]_buffer
        discard;

    int indexOld = imageAtomicExchange(head_buffer, ivec2(gl_FragCoord.xy), index);
    imageStore(next_buffer, index, ivec4(indexOld, 0, 0, 0));
    float depth = gl_FragCoord.z;
    imageStore(data_buffer, index, vec4(1.0, 1.0, 1.0, depth)); // test-wise only white pixel used for simplification

    out_col = vec4(1.0, 1.0, 1.0, 1.0);
}

Actually some rather easy shader, but the performance is very poor. Executing the just shown shader needs around 120 ms. I would have expected much less!

My hardware is:

AMD Phenom 9650
4 GB DDR2 800 RAM
Palit GeForce GTX 460 with 768 MB VRAM
Ubuntu 14.04 LTS
nVidia binary driver 340.96

I tried out the never nVidia driver ‘352.63’, but that one is a catastrophe, my shader executes around two to three times slower on the newer nVidia driver.

I appreciate ANY comments, also critics - but please indulge me, I’m a beginner in what concerns GLSL :slight_smile:

Executing the just shown shader needs around 120 ms.

Executing it on what? What are you drawing? How much overdraw is there?

I would expect using a global atomic counter to kill parallelism.

Unfortunately, most of the resources on contention-avoiding algorithms are for CUDA; there’s almost nothing regarding the performance costs of atomics or coherence in GLSL.

Thanks guys for your replies.

A ‘sponza’ scene. I tried 8 times to post a screen shot image or at least a link to a screen shot image here, but the OpenGL forum software ‘denied’ that. Please re-assemble the following URL: http://uploads.gamedev.net/monthly_08_2014/post-222765-0-11808500-1409079548.png
This is not exactly my scene but looks very close to mine.

I got this scene together with an OpenGL render framework by a fellow student.

My fragment shaders render the sponza scene smoothly when implementing e.g. screen space reflections that take the screen buffer data from a buffer texture of a previous, ‘normal’ scene render pass, but as soon as I involve one to two imageLoad() calls per pixel to get data from per-pixel-linked lists instead of buffer textures, rendering one frame takes between 1.5 to 3 seconds.

As I already told, the fragment shaders I wrote before were all rather fast, but when doing imageLoad or imageStore calls, performance is suddenly very poor.

Have you confirmed that it’s imageLoad/imageStore that causes the performance hit, rather than atomicCounterIncrement or imageAtomicExchange?

What effect, if any, does removing the “coherent” qualifier on the images have upon performance?

I’m aware that simply removing the calls and/or qualifier will break functionality, but it may also point toward a more viable approach.

Hi JasonRay. Yes, new forum users aren’t allowed to post links and images. This is to prevent folks from subscribing just to post spam (we’ve had this problem before). Sorry for the inconvenience. After you’ve posted a few more times, you’ll be able to post links and images like everyone else here.

I’ve edited your post above and fixed-up the image link for other readers.

Thanks for your reply and for assembling the link.

[QUOTE=GClements;1280707]Have you confirmed that it’s imageLoad/imageStore that causes the performance hit, rather than atomicCounterIncrement or imageAtomicExchange?

What effect, if any, does removing the “coherent” qualifier on the images have upon performance?

I’m aware that simply removing the calls and/or qualifier will break functionality, but it may also point toward a more viable approach.[/QUOTE]

Yes, it strongly seems as if imageLoad() is the call that does extremely slow down e.g. my Screen Space Reflections fragment shader.
That shader ran smoothly when reading the current fragment’s depth value from a buffer texture. I replaced the reading from the buffer texture by traversing the per-pixel-linked list, and now I get render times in second- instead of millisecond-range.

Here’s an excerpt of the original the shader code:


float current_z_buffer_value = get_depth(ray_pos_current_on_screen.x, ray_pos_current_on_screen.y, layer_index); // texture(depthbuffer_tex, vec2(ray_pos_current_on_screen.x, ray_pos_current_on_screen.y)).r;

Please notice reading from the buffer texture is now commented-out, and I call get_depth(), which looks like this:


float get_depth(float x, float y, int layer_index)
{
    int iMax = min(debug_val2, layer_index);
    int next = imageLoad(head_buffer, ivec2(int(x * screendim.x), int(y * screendim.y))).r;
    for (int i = 0; i < iMax; i ++)
    {
        next = imageLoad(next_buffer, next).r;
        if (next < 0)
            return 100000000.0; // HACK; use a constant here!
    }
    return imageLoad(data_buffer, next).w;
}

debug_val2 is 4. I tried 0, 1, 2, … as well. 0 does hardly slow down the shader, when using more the shader quickly becomes practically unusable as too slow :frowning:

I already tried removing ‘coherent’, it had no (significant) effect on performance, in particular the shader didn’t become sufficient fast.

I also tried ‘readonly’ and ‘restrict’, both didn’t have a measurable effect.

:doh:

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.