Running kernel multiple times with alternating buffers

I’m running a kernel repetitively with results being stored in a buffer. The results require a small amount of processing by the host, so I thought it would be good to use an alternating set of buffers, so I could queue up two kernel runs to start, each with its own buffer, wait for the first one to finish and process the results while the second kernel run is in progress, but it doesn’t seem to be working that way. When I call clEnqueueMapBuffer on the first buffer, it is making me wait all the way until the second kernel run is finished, which I’ve confirmed with various timings, and can be seen in CodeXL:

[Can’t seem to get images to upload!]

Here are some code snippets showing what I’m doing:

// provide kernel with a search buffer, and other arguments (omitted)
m_searchKernel.setArg(0, m_searchBuffer[0]);
m_queue.enqueueNDRangeKernel(m_searchKernel, cl::NullRange, m_globalWorkSize, s_workgroupSize);

// and again with buffer #2
m_searchKernel.setArg(0, m_searchBuffer[1]);
m_queue.enqueueNDRangeKernel(m_searchKernel, cl::NullRange, m_globalWorkSize, s_workgroupSize);

// wait for first kernel to finish
search_results* results = (search_results*) m_queue.enqueueMapBuffer(m_searchBuffer[0], true, CL_MAP_READ, 0, sizeof(search_results));

// process the results

// release the buffer
m_queue.enqueueUnmapMemObject(m_searchBuffer[0], results);

I tried placing a call to m_queue.flush() after each call to enqueueNDRangeKernel, but that didn’t help.

So my question is, how can I get at the results from the first kernel run while the second kernel run is still in progress? Do I have to use separate queues?

Try non-blocking map and synchronize using events.

cl::event e;
void* ptr = clEnqueueMap(result_of_first, CL_FALSE, e)//last argument is the event that will trigger after this map will be finished
//ptr is now  a valid pointer that contains stuff you need)

This is a pseudocode, but I hope it’s clear enough.

Ok, I tried that, but it doesn’t seem to be working. Here is a simplified version of my code:

cl::Event events[2];
search_results* results[2];
unsigned buf = 0;
bool FirstLoop = true;

while (true)
	// queue up a kernel run
	m_searchKernel.setArg(0, m_searchBuffer[buf]);
	m_queue.enqueueNDRangeKernel(m_searchKernel, cl::NullRange, m_globalWorkSize, s_workgroupSize);

	results[buf] = (search_results*) m_queue.enqueueMapBuffer(m_searchBuffer[buf], CL_FALSE, CL_MAP_READ, 0, sizeof(search_results), 0, &events[buf]);
	buf = (buf + 1) % 2;

	if (!FirstLoop)
		// wait for the results

		// process the results

		// release the buffer
		m_queue.enqueueUnmapMemObject(m_searchBuffer[buf], results[buf]);

		// do some lengthy processing
		Timer t;
		while (t.elapsedMilliseconds() < 30) { Sleep(10); }

	FirstLoop = false;

Here’s a screenshot from CodeXL. It shows the 2 enqueueKernel / enqueueMapbuffer commands, back-to-back, then, pretty much at the same time the kernel starts running, and the wait events shows that it is waiting, but the wait event does not release until the second kernel has finished!

This is strange. It would be kind of understandable if mapping would finish after the first kernel without overlapping with the second kernel, but this is much harder to explain. Perhaps there is some heuristic in place that postpones a device-wide memory barrier to improve the GPU utilization. Using multiple queues is probably unavoidable then.

I agree with Salabar regarding multiple queues and I’m pretty sure it is the only way you can do it efficiently.
I suggest multi-buffering with three queues.

It definitely seems like what you are looking to do.

If you have questions regarding that, field them and someone will answer it for you.

I highly suggest googling first for some other links or presentations that will teach you how to do it.
Otherwise come back here and I will provide some assistance.

This is from a CUDA presentation but is essentially what we are suggesting.

Well, I managed to get it working. Thanks for the help. For those who may be interested, it involved multiple queues (as had been discussed) and non-blocking calls to enqueueMapBuffer, with associated event objects and wait functions. In addition, I added event synchronization objects to the enqueueNDRangeKernel call, because I found that the kernels would run at the same time on the single GPU, in some kind of time sharing way, and I wanted them to run sequentially, one at a time.