Hi,
I am new to OpenCL, and I am porting some MPI code I have, as I am hoping to see a benefit from using a GPU.
The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.
for(int i = 0; i < L + 1; i++){
for(int j = 0; j < L + 1; j++){
for(int k = 0; k < L + 1; k++){
some_conditionals
for(int l = 0; l < L + 1; l++){
some_conditionals
G = 1.0;
for(int m = 0; m < L + 1; m++){
some_conditionals
G = some_math;
} // end M loop
blah[i][j] += some_math;
} // end l loop
} // end k loop
}// end j loop
}// end i loop
My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah[i][j]. But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.
In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)
To parallelize i,j I just use an 2D NDRangeKernel like so:
err = queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange((L+1), (L+1)),
cl::NDRange(1,1),
NULL,
&event);
checkErr(err, "CommandQueue::enqueueNDRangeKernel()");
I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah[i][j], which I don’t know how to do yet. I’m wondering am on the right track? Any suggestions?
I tried breaking up the 3rd loop, and running a loop queuing kernels with an additional arg (k=__)
Even doing that seems to be too much work in the kernel, as with L > 60 it will trigger windows TDR killing the kernel.
I am learning OpenCL as I go, and my background is MPI.
Thanks!