# Parallelizing nested loops

Hi,

I am new to OpenCL, and I am porting some MPI code I have, as I am hoping to see a benefit from using a GPU.

The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.

``````
for(int i = 0; i < L + 1; i++){
for(int j = 0; j < L + 1; j++){
for(int k = 0; k < L + 1; k++){
some_conditionals
for(int l = 0; l < L + 1; l++){
some_conditionals
G = 1.0;
for(int m = 0; m < L + 1; m++){
some_conditionals
G = some_math;
} // end M loop

blah[i][j] += some_math;
} // end l loop
} // end k loop
}// end j loop
}// end i loop

``````

My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah[i][j]. But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

To parallelize i,j I just use an 2D NDRangeKernel like so:

``````
err = queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange((L+1), (L+1)),
cl::NDRange(1,1),
NULL,
&event);
checkErr(err, "CommandQueue::enqueueNDRangeKernel()");

``````

I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah[i][j], which I don’t know how to do yet. I’m wondering am on the right track? Any suggestions?

I tried breaking up the 3rd loop, and running a loop queuing kernels with an additional arg (k=__)

Even doing that seems to be too much work in the kernel, as with L > 60 it will trigger windows TDR killing the kernel.

I am learning OpenCL as I go, and my background is MPI.

Thanks!

But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

Assuming that you don’t want to change the 5 second timeout what you can do very easily is perform multiple calls to clEnqueueNDRangeKernel() instead of doing a single one. You can use the global_work_offset argument to partition the work into smaller pieces.