This is the code snippet where the OpenCL Kernels are queued
globalWorkSizeCPU[0] = (size_t)ceil(((float)ncols) / ((float)DIM_LOCAL_WORK_GROUP_X)) * DIM_LOCAL_WORK_GROUP_X;
globalWorkSizeCPU[1] = (size_t)ceil(splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;
offsetCPU[0] = 0;
offsetCPU[1] = 0;
globalWorkSizeGPU[0] = (size_t)ceil(((float)ncols) / ((float)DIM_LOCAL_WORK_GROUP_X)) * DIM_LOCAL_WORK_GROUP_X;
globalWorkSizeGPU[1] = (size_t)ceil(((float)nrows) / ((float)DIM_LOCAL_WORK_GROUP_Y) - splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;
offsetGPU[0] = 0;
offsetGPU[1] = (size_t)ceil(splittingPoint) * DIM_LOCAL_WORK_GROUP_Y;
errcode = clEnqueueNDRangeKernel(clGPUCommandQue, clGPUKernel, 2, offsetGPU, globalWorkSizeGPU, localWorkSize, 0, NULL, NULL);
errcode = clEnqueueNDRangeKernel(clCPUCommandQue, clCPUKernel, 2, offsetCPU, globalWorkSizeCPU, localWorkSize, 0, NULL, NULL);
The problem here is that I am enqueuing both the kernels but from the timing results, it seems as if the kernels are not getting queued parallely.
The command queues are getting scheduled one after another.
I am using the Odroid XU-3 board which has a ARM CPU and a Mali GPU. Both devices run on different platforms.
Could anyone help me to solve the issue? Kind of urgent!!
I also tried reversing the order of the Enqueue functions but it did not work.