I am implementing stream compaction algorithm, more specifically block addition, where work group prefix sums are adjusted with the group total. My kernel code looks as follow:
__kernel void blockAddition(__global int* input, __global int* output)
{
int gid = get_global_id(0);
int bid = get_group_id(0);
int tid = get_local_id(0);
__local float value[1];
if(tid == 0)
value[0] = input[bid];
barrier(CLK_LOCAL_MEM_FENCE);
output[2 * gid] += value[0];
output[2 * gid + 1] += value[0];
}
My test data is: output contains repeating subarray [0,1,2…511,0,1,2…511,0,1,2…511…], input contains prefix sum of each subarray total sum shifted right [0, 262144, 524288, 786432…]. The issue occurs when the number of elements in output exceed 2^24 (16777216). In this case starting from index 16777216 instead the increasing sequence I have a sequence with repetitions:
Expected: [16777214, 16777215, 16777216, 16777217, 16777218, 16777219, 16777220]
Actual: [16777214, 16777215, 16777216, 16777216, 16777218, 16777220, 16777220].
When I introduce a buffer variable and replace the last two lines of the original method with the code below, everything works as expected:
int buffer = value[0];
output[2 * gid] += buffer;
output[2 * gid + 1] += buffer;
Can someone please help me with the explanation of such behavior?