memory buffer question


I am learning openCl and I noticed that most applications I have seen so far mostly create the memory buffer with clCreateBuffer and then use clEnqueueWriteBuffer to get the data in. Now I was wondering why do they not put it straight to the clCreateBuffer but do it in 2 steps??? is there a performance benefit or something?

Thanks in advance,

kind regards,


Now I was wondering why do they not put it straight to the clCreateBuffer but do it in 2 steps??? is there a performance benefit or something?

No, there’s no performance benefit in doing this in two steps. You can call clCreateBuffer() with the CL_MEM_COPY_HOST_PTR flag and do both creation and copy in one step.

thanks for your reply, that was indeed what i though but was not sure …

another question though: when i have the following setup:

kernel1 + command queue1 + buffer1 + buffer2 + event1
kernel2 + command queue1 + buffer2 + buffer3 + wait for event1

by using the event1 to sync between 2 kernels where the second uses the output from the first, we have memory consistency within the device. is this correct?

That should work. But if you’re using a default in-order queue, then it is also unnecessary.

Event synchronisation becomes necessary when using multiple queues, multiple devices, or out-of-order queues.

I have some questions related to this when 2 devices are used:

1- If he had multiple devices, and kernel1 was ran on device1 and kernel2 was to be run on device2. If he uses CL_MEM_COPY_HOST_PTR, does the buffer1 get copied to device2 too?

2- The code is waiting for event1 which is tied to kernel1 execution. When kernel1 execution finishes, the resulting buffer2 is guaranteed to be available to device2 right away?

3- After the start of kernel1, but before the finishing of it, a host thread changes contents of the host memory where buffer3 is located. Do we have to enqueue a write before running kernel2?

4- If multiple devices are used with multiple queues, when reading buffer3 back to host memory, does it matter which queue is used? (we wait for kernel2 events to finish).


All but 4 are simple reads of the specification (and are yes, assuming you stick to the rules and do it properly).

For 3, if buffer3 is used by kernel1 then you’re breaking the rules.

For 4, try the archives, from memory it shouldn’t matter for valid data results (assuming all the synchronisation is correct): but it may affect performance. Intuitively the last device to write to the data will be the best one to read it from … and I would have an expectation that that would at least be a reasonable way to approach it.

It is not very clear to me, or I couldnt find where this is clearly explained (dont say appendix because I have been there :D)


No, it is only used by kernel2, but the question boils down to exactly when the buffer is copied to device, after creating the buffer, it might or not be copied to device right? So safest bet would be using an enqueuewrite if the data was changed after creation of buffer even though that buffer was not used yet?

Yes, that makes sense a little bit. But then, for number 2 you said yes, which would imply that OpenCL does not finish kernel execution until the shared buffers are 100% synchronized between devices. (or does not start a new operation?). If that is the case, then there would be no performance advantage or penalty to use any device for reading the resulting data. Would you agree?

It is not very clear to me, or I couldnt find where this is clearly explained (dont say appendix because I have been there :D)
Well I will tell you go to the appendix - in fact the very first page of the appendix (A.1) answers most of your questions. Have you really read it?

also from the spec:

  1. section 5.2 for buffer creation flags, conventions.

  2. section 5.11, second paragraph about ordering of kernels and data movement.

  3. doesn’t actually have enough info to tell me what you’re doing: if you’ve used copy_host_ptr then whatever ‘buffer’ your referring to is explicitly un-referenced by opencl - again section 5.2.1; so a write is obviously required.

The only way opencl can possibly use a buffer you have allocated is when you have USE_HOST_PTR set, and then the specification clearly states the driver might cache it elsewhere and so a write is required there too. Iif you search for USE_HOST_PTR throughout the spec is lists all the various conditions.


As above: unless you’re using use_host_ptr then your buffer is nothing to opencl after the buffer creation call - and given the api has no ‘wait’ event, it implies copy_host_ptr is synchronous (i.e. immediate).

And if you are using use_host_ptr, then you have to write anyway.

It’s no ‘safest best’, it’s simply the defined api contract …

I said nothing about kernel’s finishing execution, your wording is wrong but I didn’t feel the need to correct it. It simply doesn’t start the operation until it has the data where the operation is going to run, and it doesn’t copy the data until the kernel that writes to it has finished with it - how else could it possibly do it? Bend time?

I don’t work on drivers: i have no idea if there is a performance penalty, but that’s not to say there isn’t one. The api only guarantees the order of execution and implying further is just guessing.