Incorrect blockAddition result

I am implementing stream compaction algorithm, more specifically block addition, where work group prefix sums are adjusted with the group total. My kernel code looks as follow:

__kernel void blockAddition(__global int* input, __global int* output)
int gid = get_global_id(0);
int bid = get_group_id(0);
int tid = get_local_id(0);

__local float value[1];

if(tid == 0)
	value[0] = input[bid];


output[2 * gid] += value[0];
output[2 * gid + 1] += value[0];


My test data is: output contains repeating subarray [0,1,2…511,0,1,2…511,0,1,2…511…], input contains prefix sum of each subarray total sum shifted right [0, 262144, 524288, 786432…]. The issue occurs when the number of elements in output exceed 2^24 (16777216). In this case starting from index 16777216 instead the increasing sequence I have a sequence with repetitions:
Expected: [16777214, 16777215, 16777216, 16777217, 16777218, 16777219, 16777220]
Actual: [16777214, 16777215, 16777216, 16777216, 16777218, 16777220, 16777220].
When I introduce a buffer variable and replace the last two lines of the original method with the code below, everything works as expected:

int buffer = value[0];
output[2 * gid] += buffer;
output[2 * gid + 1] += buffer;

Can someone please help me with the explanation of such behavior?

Try to put the local definition before int gid = get_global_id(0);

and int to float may not be a good idea " value[0] = input[bid]; "

buffer is an int ;))

Hi @hterrolle. Thanks for your answer. Changing the local type from float to int really fixes the issue. But I still not really understand why original code does not work correctly. Placing the local definition on top does not change its behavior.

Moving _local on top was just a try. we never know the compiler process,restriction and obligation ;))

it is a problem of type. float can be with fixed décimal part i think.
But the rule is to be consistant with the use of the type to avoid problem. ;))

in ARM OpenCL for example if you do “int X = 1/2” X would be 0 it is an automatic floor.
In OpenCL there no so much function so it look like it is very stric using type.

But i am not sur. It is just my thinking. May be some one else could answer much better. ;))

i have also made a test and it is also strange.

struct my_debug {
int tableau[36];
static int* deb[36] = {0,0,0,0,0,0,

struct my_debIdx {
int tableau[1];
static int* debIdx[1] = {0};

    debug           = cl::Buffer(gContext, CL_MEM_USE_HOST_PTR , 36*sizeof(int), deb, NULL);
    debugIdx        = cl::Buffer(gContext, CL_MEM_USE_HOST_PTR , 1*sizeof(int), debIdx, NULL);

__global int* debug, //
__global int* debugIdx,


debug[0] += 1;

and i never find the same result ?

so …