Querying kernel progress

I’m trying to display kernel progress by incrementing a value in int pointer. My kernel is very simple:

__kernel void raycastHeightmap(__read_only  image2d_t srcImg,
                       __global int* totalProgress)

I execute the kernel n number of times to prevent Windows watchdog kicking my kernel out:

int[] mapBakeProgress = new int[1];

Cl.Mem progressBuffer = Cl.CreateBuffer(_context, Cl.MemFlags.CopyHostPtr | Cl.MemFlags.WriteOnly, (IntPtr)(sizeof(int)), mapBakeProgress, out error);
CheckErr(error, "Cl.CreateBuffer");

error = Cl.SetKernelArg(kernel, 0, (IntPtr)intPtrSize, progressBuffer);
CheckErr(error, "Cl.SetKernelArg");

IntPtr[] workGroupSizePtr = new IntPtr[] { (IntPtr)128, (IntPtr)128, (IntPtr)64};

for (int i = 0; i < 64; i++) {
	error = Cl.EnqueueNDRangeKernel(cmdQueue, kernel, 3, null, workGroupSizePtr, null, 0, null, out clevent);
	CheckErr(error, "Cl.EnqueueNDRangeKernel");

	error = Cl.Finish(cmdQueue);
	CheckErr(error, "Cl.Finish");

	//Update progress
	error = Cl.EnqueueReadBuffer(
		cmdQueue, progressBuffer, Cl.Bool.False, IntPtr.Zero,
		new IntPtr(sizeof(int)),
		mapBakeProgress, 0, null, out clevent);
	CheckErr(error, "Cl.EnqueueReadBuffer");
	Console.Writeline("Progress: " + mapBakeProgress[0]);

As far as I understand my progress should get increased by 12812864 = 1048576 (total work group size) with every iteration in my host application. However I’m getting totally random numbers like:

Progress: 588
Progress: 1173
Progress: 1758
Progress: 2352

And what’s even more strange, those numbers change every time I relaunch my application.

What’s wrong with my code and how can I make it properly update the progress value with every kernel execution?


The problem is there is no protection on that memory; each thread in the kernel is doing a read/modify/write, but if two threads read the same value, increment it, and write it back, the result is not increased by two (just one).

An atomic can be used for accurate accumulation from multiple threads, but they can be slow. You don’t want your progress code to slow down the real work!

Why do you need this level of completion measurement? The runtime of a kernel needs to be short (preferably less then a second, generally much less so) so why don’t you just measure how many kernel executions have finished? In your code that queues up 64 kernels, get a clEvent back from each one of them and then replace the code reading the buffer with code that queries the events and counts up how many kernels have completed (you only need to check one clEvent at a time, and when it completes, move to the next one). This will give you 64 increments on your progress meter. Don’t forget to release the events when you are done with them or you will leak memory.

This makes lots of sense. Instead I decided to measure the progress in the host’s loop that launches the kernels.

The launches should be very quick though since they are just queuing up work. That’s why I suggested events. You could queue all the kernels and then wait on the waits to measure progress.

What you’re saying is that I can enqueue all the 64 kernels all at once without waiting for the previous ones to complete? Do I have to call “error = Cl.Finish(cmdQueue);” every time before enqueueing a kernel in my loop?

it matters how you created your Queue. If it is synchoneus, each enqueued Kernel will be completed before the next wone starts. this will only fail if you have an asynchonous queue. I’m not really shure if this also counts if you execute a kernel, do a read, do a write and execute the kernel with the written data. memory transfer might be a little different.