Parallelizing nested loops

Hi,

I am new to OpenCL, and I am porting some MPI code I have, as I am hoping to see a benefit from using a GPU.

The portion of the code I am having trouble with updates a 2D array, but it does so using a 5 deep nested loop.


for(int i = 0; i < L + 1; i++){
    for(int j = 0; j < L + 1; j++){
        for(int k = 0; k < L + 1; k++){
            some_conditionals
                for(int l = 0; l < L + 1; l++){
                    some_conditionals
                        G = 1.0;
                        for(int m = 0; m < L + 1; m++){
                            some_conditionals
                                G = some_math;
                        } // end M loop

                        blah[i][j] += some_math;
                } // end l loop
        } // end k loop
    }// end j loop
}// end i loop

My first reaction was to parallelize the outer 2 loops, (i,j), because then each thread could work on a unique blah[i][j]. But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

In the code, the some_conditionals are based on the indexes i,j,k,l,m (e.g., i != m)

To parallelize i,j I just use an 2D NDRangeKernel like so:


    err = queue.enqueueNDRangeKernel(
        kernel,
        cl::NullRange,
        cl::NDRange((L+1), (L+1)),
        cl::NDRange(1,1),
        NULL,
        &event);
    checkErr(err, "CommandQueue::enqueueNDRangeKernel()");

I would have liked to be able to use a 3D NDRange kernel, (parallelize i,j,k) but if I do that, I need to perform some type of reduction on blah[i][j], which I don’t know how to do yet. I’m wondering am on the right track? Any suggestions?

I tried breaking up the 3rd loop, and running a loop queuing kernels with an additional arg (k=__)

Even doing that seems to be too much work in the kernel, as with L > 60 it will trigger windows TDR killing the kernel.

I am learning OpenCL as I go, and my background is MPI.

Thanks!

But that is still too much work for each thread. I am doing this on Windows with a ATI 5870, so I want the batches of kernels to complete within the TDR, otherwise windows will kill the kernel.

Assuming that you don’t want to change the 5 second timeout what you can do very easily is perform multiple calls to clEnqueueNDRangeKernel() instead of doing a single one. You can use the global_work_offset argument to partition the work into smaller pieces.