Low level behaviour of clEnqueueWriteBuffer() - spec vs impl


Can somebody please clarify for me how clEnqueueWriteBuffer() operates at a hardware/OS level? And how much is in the specification versus implementation specific.

The reason I ask, is that I’m writing a biophysics simulator on my MacOS system and it seems that I only need to clEnqueueWriteBuffer() at the very beginning, then as I read out to that same memory address (in order to perform inter-cell communication) whatever is there (even though it’s modified between timesteps) gets sent back down the pipeline to the GPU on the next call to clEnqueueNDRangeKernel() and I don’t need to call clEnqueueWriteBuffer() again.

This doesn’t seem to me like it’s really according to the spec, it seems like it’s a lucky coincidence of the implementation as the same memory locations are conserved etc. But I’d really like to know the official stance. Ultimately I will be running this on other systems so I will want something that’s pretty OpenCL implementation independent.

Furthermore, this speeds up my code by 100% (it takes half the time). So if this is against the spec, is there an alternative way of specifying my memory/accesses such that I can guarantee this behaviour (such as by using a map buffer perhaps)?

I would definitely appreciate some feedback on this one, especially if there’re still some of the guys who are working on the official spec out there, it’s something I’ve been wondering about for a few months now but nobody I know seems to be able to answer. Feel free to be as technical as you like in your response, it shouldn’t scare me too much.


Are you modifying the values on the CPU side? Then yes according to spec you should be calling enqueueWriteBuffer again. But given the way enqueueWriteBuffer() works there’s no possibility you could get cpu-modified values to the GPU without calling it (ignoring the map/unmap api) I suspect you don’t mean this.

If you’re only modifying the data on the GPU side, then of course the data is persistent, otherwise the api wouldn’t be much use at all! Nor would the gigabytes of high-speed gpu RAM!

BTW this has nothing to do with the low-level behaviour of how memory management might be implemented. e.g. the buffers could be moved from device-to-from-host memory as required, all that the api guarantees is that the memory is available to the kernel when it is executing.

I can’t find explicitly where this is stated in the specification, otoh, ‘memory’ is ‘memory’ is ‘memory’, and it behaves the way you’d expect from any other allocation - it hangs around with it’s last written content until freed.

If you were using the map/unmap then potentially you could modify CPU side and have it appear GPU side without the correct api calls: and this would be out of spec and implementation dependent and would depend on the low-level memory implementation. But there’s no way to do this with the writebuffer api as you never have direct access to the buffers writebuffer writes to.

Thanks Notzed. You’ve given me a reply which corresponds with my own understanding of the spec. The catch is that’s not how it’s working on my computer ! :slight_smile:

I am, as you asked, modifying the data both on the GPU and, in between calls to the kernel, on the CPU. It shouldn’t work, but it is. I guess it’s the Apple implementation of OpenCL that I’m using that’s doing this (I’ve found other non-standard things in their implementation in the past). But thanks for confirming my understanding of the spec, I think I’ll re-enable the call to enqueueWriteBuffer() [I only disabled it as a test] as it makes me very uncomfortable to have such undefined behaviour going on.

If anyone else has any input on why cl_enqueueWriteBuffer() is operating more like a memory map on my mac I’d be interested to hear it!

What spec and driver version do you have?

Are you specifying USE_HOST_POINTER?

It’s OpenCL 1.0 (unfortunately, that’s the one that comes with the OS), the Apple version of the OpenCL framework is 12.3.6 dated 18/05/2011.

The device is an AMD Radeon HD 6750M with driver version 01.00.573 (probably Apple version numbering).

No notzed, I am not using CL_MEM_USE_HOST_PTR (good question though). The only flag I’m using in creating the buffer is CL_MEM_READ_WRITE.

I guess I’m coming around to the view that this is a ‘feature’ of the implementation, but not one that I should rely on. I think my main worry when I found it was that I had been misreading the specification; I’m self-taught in OpenCL so it would be easy to miss something.


Yeah that is truly strange.

Just for the sake of the public record, I’ve been playing with this issue a little more and it seems to be a transitory effect. By that I mean it doesn’t always happen. It’s reproducable, but not always. To be honest that’s even stranger. I suspect it’s something to do with memory being assigned for other purposes occasionally. But anyway, my main point is that if someone comes across this conversation they should definitely continue to re-enqueue the write buffers on each step and not be tempted to speed up the simulation by leaving it out.

Thanks for the help guys.