Problem debuging Kernel

hi,

i found a strange behaviour debuging kernel.
I use this

struct my_debIdx {
    int tableau[1];
};
static int* debIdx[1] = {0};
my_debIdx debIdx;

for retreiving the number of time i set a spécial value in out[global index].x

I create the GPU buffer

debugIdx = cl::Buffer(gContext, CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR, 1*sizeof(int), debIdx, NULL);

i pass the debug buffer to the kernel

__global int* __restrict__ debugIdx

and each time i set out[global index].x count it.

(*debugIdx)++;

then a retreive the information using

    gQueue.enqueueReadBuffer(debugIdx, CL_TRUE, 0, 1*sizeof(int), debIdx.tableau);
    LOGI(" debug0: indIdx value %5d \n",debIdx.tableau[0]);
    
    // réinit the buffer to zéro for anaother use.
    debIdx.tableau[0] = 0;
    gQueue.enqueueWriteBuffer(debugIdx, CL_TRUE, 0, 1*sizeof(int), debIdx.tableau);

I work with cl::NDRange(1024,1024), cl::NDRange(2,2) so the number of work_group to be processed is (1024 / 2)^2 = 512^2 = 262144 work_group.

When i work with smal buffer (36 * 36) everything are good but using (1024 * 1024) it look like the out number is a lot smaller than what i get when i retreive the data later.

that is what i get from the kernel
debug0: indIdx value 1601
and what i retreive form the output kernel buffer passed to CPU
void Extraction_Point: buf.bufligne Rouge: 25467

so kernel give 1601 and extracting the value from the output buffer give 25467 and it is the good value no doubt.

May be i made an error somewhere ?

1 Like

hi,
It look like __global int* __restrict__ debugIdx is not shared through all the work group. I tried with the printf and i can find many time the same index number. does it that we call race condition ?

1 Like

I just did a new test. i changed the way of testing and i can get all info from each work group, even thread inside work group.

I just chaged the struct to have the size of all the work group and second dimension for thread inside work group.

Increamenting index does not work. And i think it is normal, too much synchronisation between work group or thread.

I learned something ;))

1 Like

Ugh, I HATE race conditions race conditions in atomic operations. For a global int, try using atomic_inc or atomic_add with memory order in OpenCL. Specifically, atomic_inc(volatile __global int*) can be helpful. If you’re using CL_MEM_USE_HOST_PTR, make sure the buffer is properly mapped; otherwise, use CL_MEM_COPY_HOST_PTR for better results. Also, consider using local memory per-workgroup if you’re facing issues with global memory. Make sure to use barriers and enqueueFinish before any reads.

  • Use atomic operations for a global counter: declare the counter as volatile __global int* counter and update with atomic_inc((volatile __global int*)counter); (or atomic_add if you need different increments).
  • Alternatively give each work-group its own slot (per-group counters in a global array) and reduce them to one value with atomics or a post-pass on the host.
  • Avoid CL_MEM_USE_HOST_PTR pitfalls: prefer a proper device buffer and a host-visible staging buffer (or call clFinish / wait on the queue) before reading back.
  • For debugging per-thread values, write per-invocation results to an indexed SSBO/global array instead of trying to atomically count every event — that’s deterministic and easier to validate.

happy coding, good luck :slight_smile:

~p3n

Thanks p3nGu1nZz, it is the only way.

i found the solution on stackoverflow but i forget to post it. here it is.

here is the host buffer:

struct my_debIdx {
int tableau[1];
};
static int* debIdx[1] = {0};

here is the cl_buffer :

debugIdx = cl::Buffer(gContext, CL_MEM_READ_WRITE|CL_MEM_USE_HOST_PTR , 1*sizeof(int), debIdx, NULL);
my_debIdx debIdx;

here is the atomic counter function :

void increase(volatile __global int* counter, ushort A, ushort B){
atomic_inc(counter);
}

here the kernel :

setArg(0,debugIdx);// sortie debug

__kernel void CounterTest( __global int* restrict counterArg
{
volatile __global int* counterPtr = counterArg;
// call the counter function where you need it
increase(counterPtr,data1,data2);
}

1 Like

Nice work noticing that mismatch! That’s exactly the kind of detail that helps you grow as a kernel developer. Debugging on the GPU often feels like chasing shadows, and the fact that you spotted something unexpected means you’re already learning how to read the system’s behavior.

Even if the numbers don’t line up perfectly, the discovery itself is valuable. Each time you catch one of these quirks, you build intuition for how parallel work really unfolds. That intuition is what makes the difference down the road.

Keep following those instincts; every small insight adds up to a clearer picture of how the machine works.

~p3nGu1nZz

PS, here is some amazing new tech that solves these problems across different hardware vendors. chou.micro2020.pdf