Subdividing gloabl Workgoup Size


i try to automatically subdivide my global workgroup size (gws) into smaller pieces using the GW offset.
Here an example:

size_t szGWS[3] = {1024,1024,1};
size_t szLWS[3] = {256,1,1};
size_t szGWO[3] = {0,0,0}
if(1024*1024*uiWIComplexity > device.AvailaleFlops) //Test if we need to subdivide problem
  int sub = 3;
  for(int i = 0; i < sub; i++)
     szGWS[1] = 1024/sub;
     szGWO[1] = 1024 * i / sub;
  clEnqKernel(..., szGWO, szGWS, szLWS,...);

I think indexing inside my Kernel works properly but synchronaization fails.
I have a synchonized queue, which means all kernels equeued should synchronize by themself, correct?

but if i do the following:
(1) copy values from buffer A to B in muliple subkernels
(2) edit values of A in multiple subkernels
(3) edit values of B in multiple subkernels

my data seems corrupted.
Does openCl waits for the whole task (1) to complete before srating (2) and (3) or does it start with the first part of (2) or (3) when the first part of (1) is done?

Thanks in advance,

As stated in the OpenCL 1.2 specification, section 5.11:

If the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE property of a command-queue is not set, the commands enqueued to a command-queue execute in order.

For example, if an application calls clEnqueueNDRangeKernel to execute kernel A followed by a clEnqueueNDRangeKernel to execute kernel B, the application can assume that kernel A finishes first and then kernel B is executed.

If the memory objects output by kernel A are inputs to kernel B then kernel B will see the correct data in memory objects produced by execution of kernel A.

Note also that in your code 1024 is not divisible by 3, so clEnqueueNDRangeKernel will fail because the global work size (1024/3=341) is not a multiple of the local work size anymore.

Thanks for your reply. This was just a quick shot from my mind. In my code i am checking for the correct division into LWG sizes.
My syncproblem has been solved. I missed an iteration in the for loop.

Why are you needing to subdivide your work? The runtime automatically does that. You can submit any global work size and the runtime will run it in sections if needed.

I used this to automatically start multiple kernels when the problem size will be larger than the flops the gpu can achieve in 2 seconds. This will make shure the windows watchdog will never get triggered. Works fine but there is a little problem with my interface I have to deal with because with automatically subdivision of the kernel you have to wait for multiple kernels and I am passing just one event reference through the interface. But I don’t think this will be a problem in the future.