Running a looped kernel

michaelm · February 12, 2010, 1:27am

Hi all,
I’m relatively new to OpenCL (but experienced with CUDA - porting an application currently, still on nvidia 285 device) and to these forums but I’ve come across an area not addressed by any of the documentation I have (opencl spec, reference card, reference pages, nvidia ocl programming guide, jumpstart guide etc etc) - or at least I haven’t found the 1 line that does apply in the 300 page spec…

In general I’m looking for information regarding running a kernel repeatedly. Ultimately the input data (128MB buffer of raw data) will vary but currently contains zeros - bar a few values to make sure the kernel was reading it properly.
The program flow I’m after is something like:


/*General Initialisation for first pass*/

//setup opencl context command queue kernel program etc...
initCL(&clCommandQueue, &clContext, &clKernel);

//initialise input data - hostInputData has been allocated as 'new cl_uchar[128MB];' and memset/populated with data
cl_mem devInput = clCreateBuffer(clContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, 128MB*sizeof(cl_uchar), hostInputData, &errCode);

//initialise results memory
cl_mem devResults = clCreateBuffer(clContext, CL_MEM_WRITE_ONLY, resultsSize, NULL, &errCode);

size_t globalWorkSize = 262144;
size_t localWorkSize = 512;

/* Main execution loop */

for (u_int32_t i = 0; i < loopCount; i++)
{

//write input data - for first pass this has already been done, but technicality in this instance
//ultimately I will employ double buffering with async copies to hide these transactions but in the spirit of simplicity...
clEnqueueWriteBuffer(clCommandQueue, devInput, CL_TRUE, 0, 128MB, hostInputData, 0, NULL, NULL);

//set kernel args
clSetKernelArg(clKernel, 0, sizeof(cl_mem), &devInput);
clSetKernelArg(clKernel, 1, sizeof(cl_mem), &devResults);

//run kernel
clEnqueueNDRangeKernel(clCommandQueue, clKernel, 1, 0, &globalWorkSize, &localWorkSize, 0, NULL, NULL);

//get results - hostResults is allocated during initialisation as 'new cl_uint[resultsSize];'
clEnqueueReadBuffer(clCommandQueue, hostResults, CL_TRUE, 0, resultsSize, devResults, 0, NULL, NULL);

}

Note: This is only psuedo-code representation

(In the background to this openCL work a second input host buffer will be populated with data and these alternated in the clEnqueueWriteBuffer instruction)

I don’t know of any logical reason why something like the above wouldn’t work however in my test example the first loop shows no problems but the second causes segmentation fault immediately after clEnqueueNDRangeKernel - all pointers and memory have been verified correct and unchanging via %p printf’s.

Any general information on re-running a kernel like this or specific gotchas that may occur in a similar scenario would be gratefully received.

Thanks in advance.

dbs2 · February 12, 2010, 3:52am

I don’t see anything there that should cause a problem. Because you’re calling write/read with CL_TRUE they will block so you are guaranteed that everything will be done before you continue (this is really bad for performance, by the way). You also don’t need to reset the kernel arguments as long as they are not changing.

Whose OpenCL are you using? It looks like a vendor issue to me.

michaelm · February 12, 2010, 4:17am

Thanks for your reply, I have no intention of using input data loads with blocking behaviour - just set that to true to guarantee sequential operation to debug the issue. Thanks for pointing out the kernel args - I had a feeling that you only needed to set them once, but again just wanted to make sure.

I’ve restructured the code such that it performs all the input stages before entering the loop. Then run clEnqueueNDRangeKernel n times. then leave the loop and accumulate the results.

As I had hoped, the correct number of results are obtained after processing and then at ‘some point in time’ afterwards, the segmentation fault occurs (there is no code executed after this query). Adding some cleanup at the end and running through ddd shows the seg fault point occurring at different places each run - leading me to believe it is a time response to an earlier event, perhaps stack corruption, which is causing the problem.

As for vendor, its Nvidia GTX285 with 190.29 drivers on redhat 5.2 64bit (cuda 2.3 also not that it’s of much relevance).

michaelm · February 12, 2010, 4:28am

Sorted it - I had specified &hostResults in the line:

clEnqueueReadBuffer(clCommandQueue, devResults, CL_TRUE, 0, resultsSize, &hostResults, 0, NULL, NULL)
when it should have just been;
clEnqueueReadBuffer(clCommandQueue, devResults, CL_TRUE, 0, resultsSize, hostResults, 0, NULL, NULL)

Strange as to why this only caused problems around the loop a second time - but most likely since the corrupted area of memory wasn’t touched until it came back round.

Thread can be closed now.