multiple kernel for single device(1 gpu)

my code need to run multiple kernel repeatedly, in order. what I did was,
when clCreateCommandQueue, I set ‘cl_command_queue_properties properties’ as ‘0’, or 'CL_QUEUE_PROFILING_ENABLE ’ if need to do the timing.
and then between each ‘clEnqueueNDRangeKernel’ or ‘clEnqueueReadBuffer’, I used ‘clEnqueueBarrier(commandqueue)’ to do the barrier.

but I have strange problem that I think it should be related to kernel not executed in order.

Is there anything I am missing here? Thank you very much.

btw, I create one context, one program, one commandqueue, and many different kernels for the same program, and run on 1 gpu.

I would suggest using events to accomplish this. The barrier will make sure all the executions of one iteration are all finished, but not ensure that they happen in-order. (Of course GPUs today are all in-order so it shouldn’t matter.)