After some pointers on ways to improve this kernel's performance

We’ve written a kernel that is fairly complex and according to profiling is understandably very slow to execute compared to the other kernels we have in this app. I was therefore after some advice on how the code might be improved. I’m not looking for a complete rewrite, just some pointers and tips, especially if there are any fundamental issues that can be highlighted. We fumbled our way through implementing this app with no prior experience of OpenCL or GPU programming (and the documentation isn’t the best either). It all “works”, but I’m sure that performance of this kernel can be improved somehow.

What we’re doing is acquiring a stream of records, one after the other, which arrive from an external device into the GPU via DirectGMA (the card is a Radeon Pro WX7100 by the way). The records go through a number of processing stages, with the kernel in question being one of these. The number of values in each record is always the same during an acquisition “run”, although it can vary from one run to the next. It’s typically around 50k values (floats).
The user defines up to 200 “regions” (windows) of interest within the record, each of which also has two further “baseline” regions defined, one either side of the “main” region. None of these region widths are fixed, and will typically vary between 5 and 20 values wide.
The kernel performs two tasks: first it calculates the average of all values across the two baseline regions. It then baseline corrects the values in the main region (using that average) and removes any spikes from within the region.

Here’s the kernel entry point, with the parameters as follows:

  • “regionRangeBuffer” - a buffer containing information about each region, six values for each: the start & end indices of the main region followed by the start & end indices of its leading and trailing baseline ranges, in the format Rs,Re,BLs,BLe,BTs,BTe,Rs,Re,… All values fall within the range: 0 to record_len-1.

  • “input_buffer” - a buffer containing the “input” records to be processed

  • “output_buffer” - a buffer where the noise corrected values will be written. Think of it as holding the same number of records as “input_buffer”, but all initially filled with 0s. The kernel will set values within those records’ regions as it processes the regions in input_buffer. The number of records that these two buffers hold can vary between 1 and 20 (depending on various settings in use for that particular acquisition “run”).

  • “record_len” - the record length

  • “noise_threshold” & “spike_threshold” - used by the noise correction algorithm

    __kernel void noiseCorrectionKernel(
    __global const int* regionRangeBuffer,
    __global const float* input_buffer,
    __global float* output_buffer,
    const int record_len,
    const int noise_threshold,
    const int spike_threshold)
    {
        int record_number = get_global_id(0); // The index of the record to process
        int region_number = get_global_id(1); // The index of the region to process
    
        // Get the region's start & end indices within the record
        int regionBufferOffset = region_number * 6;
        int region_start = regionRangeBuffer[regionBufferOffset];
        int region_end = regionRangeBuffer[regionBufferOffset + 1];
        int region_len = (region_end - region_start) + 1;
    
        // Ditto for the region's leading and trailing baseline regions
        int baseline1_start = regionRangeBuffer[regionBufferOffset + 2];
        int baseline1_end = regionRangeBuffer[regionBufferOffset + 3];
        int baseline1_len = (baseline1_end - baseline1_start) + 1;
        int baseline2_start = regionRangeBuffer[regionBufferOffset + 4];
        int baseline2_end = regionRangeBuffer[regionBufferOffset + 5];
        int baseline2_len = (baseline2_end - baseline2_start) + 1;
    
        int recordStart = record_number * record_len; // Index within input buffer of record to be processed
        int startIndex = recordStart + region_start; // Index within input buffer of region to be processed
    
        // Create pointers to the region's data in both the input & output buffers.
        __global const float* regionData = &(input_buffer[startIndex]);
        __global float* resultData = &(output_buffer[startIndex]);
    
        // Create pointers to the two baseline regions' data in the input buffer.
        __global const float* baseline1Data = &(input_buffer[recordStart + baseline1_start]);
        __global const float* baseline2Data = &(input_buffer[recordStart + baseline2_start]);
    
        // Do the real work:
        float avgBaseline = calculate_baseline(baseline1Data, baseline2Data, baseline1_len,  baseline2_len);
        noise_correct(regionData, resultData, region_len, avgBaseline, noise_threshold, spike_threshold);
    }
    

Here’s the “calculate_baseline” function, which calculates the average of the values from the two baseline regions:

float calculate_baseline(
    __global const float* baseline1Data,
    __global const float* baseline2Data,
    const int baseline1Len,
    const int baseline2Len)
{
    float baseline = 0;
    for (int i = 0; i < baseline1Len; ++i)
    {
        baseline += baseline1Data[i];
    }
    for (int i = 0; i < baseline2Len; ++i)
    {
        baseline += baseline2Data[i];
    }

    return baseline / ((float)(baseline1Len + baseline2Len));
}

And here’s the “noise_correct” function. In a nutshell this iterates over every value in the region, subtracts the baseline, then writes the updated value to the record in the “output” buffer. It also looks for spikes, and when one is found, “erases” the spike from the “output” record by setting the spike’s values to 0:

void noise_correct(
    __global const float* regionData,
    __global float* resultData,
    const int bufferLength,
    const float baseline,
    const int noise_threshold,
    const int spike_threshold)
{
    //First and last value in the region are always set to 0
    resultData[0] = 0.0;
    resultData[bufferLength - 1] = 0.0;
    
    int count = 0;
    int x = bufferLength - 1;
    for (int i = 1; i < x; ++i)
    {
        // Subtract baseline and copy new value to the record in the output buffer
        float value = regionData[i] - baseline;
        if (value <= noise_threshold) value = 0.0;
        resultData[i] = value;

        // This part looks for narrow "spikes" within the region's data. When one is found, the values that
        // make up the spike are replaced by 0.
        if (value > 0.0)
        {
            ++count;
        }
        else
        {
            if (count > 0 && count <= spike_threshold)
            {				
                for (int j = i - count; j < i; ++j)
                {
                    resultData[j] = 0.0;
                }
            }
            count = 0;
        }
    }
}

There are clearly a lot of local variables and “for” loops in all this code, but I can’t see a way to optimise these given that region sizes can vary.

I realise it’s common practice to copy data from the global “input_buffer” to local arrays to work on, but (unless I’m missing something) this seemed unnecessary - the kernel only iterates over the regions’ values once, so whether you do this directly or copy the values to local first, either way it involves accessing the regions’ values in global memory.