OpenCL workgroup synch min/max values for a group

I’m trying to work the following problem.
I am trying to create a tiled version of an image where per tile i need to find the min/max luminance value.
To start with i began with a 1024x512 image with tiles of 16x16 each.

Based on what i’ve been reading this is the kernel i came up with using barriers to synch the local threads in a workgroup. I assumed that by synchronising it and creating a __local variable for Min and Max they could actually come up with a correct value for it, but it’s not the case… My results are every frame the image changes in a way i can’t really understand what values it comes up with:

on the host side here’s what i’m doing:

1024*512 work items in workgroups of 16x16

    clProgram->Set( 0, clTexMem );  // 1024x512 res input
    clProgram->Set( 1, clMinMaxLumMem ); // 64x32 res output
    clProgram->Set( 2, 0.0f ); // default min value
    clProgram->Set( 3, 1.0f ); // default max value
    clProgram->Set( 4, tileSize ); // tile size (default: 16)
    clProgram->Set( 5, AppParams::showTileMaxZ?1:0 ); // a condition in the kernel that renders the min OR max just for debugging
    int workItemsX = Math::Min( (int)clDevice->info.maxWorkGroupSize, tex->GetWidth()/tileSize );
    int workItemsY = Math::Min( (int)clDevice->info.maxWorkGroupSize, tex->GetHeight()/tileSize );
    clDevice->Run2D( workItemsX*tileSize, workItemsY*tileSize, tileSize, tileSize ); 

Here’s the kernel code:

__kernel void Image_GetMinMaxLum( read_only image2d_t srcImage, write_only image2d_t dstImage, float nearClipPlane, float farClipPlane, int tileSize, int showTileMaxZ )
    const float3 lumColor = (float3)( 0.30, 0.59, 0.11 );

    int uStart = get_global_id(0);
    int vStart = get_global_id(1);
    int lx = get_local_id(0);
    int ly = get_local_id(1);

    float4 color;

    // input image coords
    int2 coords = (int2)( uStart, vStart );

    // output image coords
    int2 coordsDst = (int2)( get_global_id(0)/tileSize, get_global_id(1)/tileSize );

    // local work group vars that store min/max values
    __local float minZ;
    __local float maxZ;

    if( showTileMaxZ > 0 )
        maxZ = nearClipPlane;
        color = read_imagef( srcImage, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, coords );
        float Y = dot( lumColor, color.rgb );  // Luminance
        maxZ = max(maxZ, Y);
        color.rgb = (float3)( maxZ );
//    barrier( CLK_LOCAL_MEM_FENCE );

    if( !showTileMaxZ  )
           // attempt to reset minZ to max value when first thread runs (assuming its the first one that runs??)
//        if( (lx+ly) == 0 )
            minZ = farClipPlane;
        color = read_imagef( srcImage, CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST, coords );
        float Y = dot( lumColor, color.rgb );  // Luminance
        minZ = min(minZ, Y);
        color.rgb = (float3)( minZ );
    barrier( CLK_LOCAL_MEM_FENCE );

    write_imagef( dstImage, coordsDst, color );

I’m pretty new with OpenCL but the current tests above where all based on what i read online and found in sources around the web.

I have to say that i did a test with FOR loops and it seems to work, but the thing is it’s probably not the way to do it.
Basically its a 16x16 loop to compute the min/max of all items in a block of pixels and save it. With this method i create less work items, but also have 16x16 image reads more.
I’m looking for advice on this, hopefully someone can help ?


No one? No ideas?

This is called “reduction” and it is such a usual operation that OpenCL C 2.0 now has built-in functions for it.

You will find a good introduction to reduction with OpenCL here:

Thanks for the help, I’ll have a read an try it.

Your access to the local variables seems to be incorrect. you arent synchronizing the access and all threads of your local workgroup access the same variable. so t1 till write maxZ, t2 will write maxZ and then you sync both threads. you should use an array like __local maxZ[get_local_size(0) * get_local_size(1)], write the max in each element and use the reduction. because your calculations are to simple to use the local memory for such a simple preoblem you could otherwise use the scan reduction utnapishtim talked about.

Hi Clint,

I understood that, but what you telling me now sounds confusing.

So the idea is to create a min/max value from all the local threads, let’s say i have 32x32 workgroup = 1024 threads.
Each will read a pixel and from all those 1024 i need the minimum and maximum values, from 1024 i need 2 values

What would be the reason to save a min/max value per thread? How would that work? Not really clear for me

I also did not understand completely on how to work the reduction part there too.

For a large enough image I’d just make each work item process a 16x16 section (using a pair of nested for loops) and output the single min/max pair. This would avoid doing reduction and using atomics. It would be simple and achieve sufficient parallelism given a large enough image.