Confused about the memory buffer usage

I’m wondering how to use the memory buffer in the most efficient and correct way. Although I’ve been reading the specification and searching some old threads for information in the forum, I’m still confused.
So, here are some of my questions

First, it’s about the flags in clCreateBuffer

According to the spec, although the memory object is created in the host, the data in this memory will be cached in the device memory during kernel execution.
Q1. when does this cache happen? (when calling the clEnqueueNDRangeKernel? )

Q2. what’s the definition of host accessible memory as mentioned in the spec? Can I simply understand it as the host memory?

Q3. when using this flag alone, where will the memory object be allocated in? I suppose it will be in the device right?
Q4. when using it with CL_MEM_ALLOC_HOST_PTR, then the memory object will be allocated in the host?

Second, about clEnqueueWriteBuffer

if what i said is right so far, then the usage mentioned in Q1, Q2 and Q4 shall be followed by clEnqueueWriteBuffer to ensure the data is passed to the device while the usage mentioned in Q3 won’t need that.
Q5. Is this correct?

Third, about clSetKernelArg

the spec says that this command will make a copy of the argument you try to pass to the kernel.
Q6. Does it mean that your host memory will be copied and also transferred to the device as argument, which makes clEnqueueWriteBuffer unnecessary even if the memory object is allocated in the host memory. Or, does it mean that the copy is made only to ensure that the argument can be reused immediately?

In fact, what I’ve been doing successfully is like this:
BTW, I’m using Intel’s integrated GPU as device.

I really need to make it clear. Correct me if I’m wrong, please.
Thanks for your help in advance.

I was about to post the very same bunch of questions… I’m a newbie and for me it’s a bit hard to understand what is exactly meant with “mapped/mappable memory” and the differences between the various CL_MEM_* flags.
My 2 cents about your Q1: when running on a Nvidia GT530 I see a speedup in execution of my code when using CL_MEM_COPY_HOST_PTR (and the related enqueueWriteBuffer/enqueueReadBuffer) instead of CL_MEM_USE_HOST_PTR. This makes me think that caching on the device memory may not happen when using CL_MEM_USE_HOST_PTR, so that the kernel will access the host memory via DMA or some other mechanism, slowing down the whole execution.

You have a good understanding of the way memory is handled.

Q1. It is up to the driver to decide. But it will probably do the mem copy when executing the kernel.
Note also that if your kernel writes to a CL_MEM_USE_HOST_PTR buffer, you will have to call either clEnqueueReadBuffer or clEnqueueMapBuffer to update the host memory from the device cache.

Q2. From the user viewpoint, it is simply memory allocated in host memory. Technically, it is allocated in such a way that accesses with the device are accelerated (pinned, pre-pinned, shared memory pages depending on the architecture of the device)

Q3 and Q4. Yes.

Q5. CL_MEM_COPY_HOST_PTR will copy memory from the pointer you provide. So you’ll have to use clEnqueueWriteBuffer or clEnqueueMapBuffer in Q1 and Q2.

Q6. In your case, the argument is the buffer handle, so only the handle will be copied, not the buffer content. As you point out, this means that you can pass a structure as an argument, then modify this structure afterwards without danger.

Looking at the documentation of clCreateBuffer I found for CL_MEM_USE_HOST_PTR:

OpenCL implementations ARE ALLOWED to cache the buffer contents pointed to by host_ptr in device memory. This cached copy can be used when kernels are executed on a device.

Thus it seems that it is left to implementation to decide whether to cache the data into the device memory or not. Based on my experiment with CL_MEM_COPY_HOST_PTR I’d say that at least for my setup this is not done since with COPY_HOST_PTR the execution is faster. Now I wonder why one would use USE_HOST_PTR (or any other flag forcing the usage of mapped host-accessible memory) on a GPU since retrieving data from host-accessible memory durnig kernel execution seems to result in a performance penalty…

It depends on the latitude you have with pointers. If you write a library, you may have to accept pointers to host memory as arguments. Then CL_MEM_USE_HOST_PTR could be the only choice you have.

Or you may be able to allocate memory yourself but have to return a pointer to host memory to the library’s user. Then CL_MEM_ALLOC_HOST_PTR would be a good choice.

Also, I want to point out that on some systems, especially those with integrated GPUs, CL_MEM_USE_HOST_PTR can be the faster alternative. In many cases, the implementation can avoid ever copying such buffers. The GPU will simply operate out of the same memory the host application allocated. And in an integrated GPU design, this can be done very efficiently.

For these kind of platform-to-platform optimizations, different vendors’ optimization guides usually have advice on this topic.

Thanks for the answers. I got the point about integrated GPUs. Still I don’t catch the comment of utnapishtim: if I write a library and I get a pointer to host memory as argument I can still copy it in some memory area managed by the library itself, send it to the GPU by using COPY_HOST_PTR and a clEnqueueWriteBuffer call. Then I get the computation results back with clEnqueueReadBuffer and put it in an appropriate host memory location whose pointer I can return as return value. According to my dummy experiments the overhead of these extra memory manipulations is less than the overhead of frequently accessing the host memory during kernel execution (at least with my discrete Nvidia GPU on linux). Even if I’m aware that these results may vary, I have the feeling that frequently accessing host memory from kernels is a major bottleneck. If the compute time is dominated by number crunching instead, then the various memory access patterns should not influence the result so much.
Am I missing something? Sorry for my dumb questions but it’s a bit hard for a newbie to disentangle the somewhat contradictory informations on this topic arising from a Google search…

Thank you, utnapishtim
Good to confirm all that. I think I have a better understanding about the usage of memory buffer now.

As kunze pointed out, when using integrated GPU design, which is my case, the vendor’s optimization guide may ask you to use CL_MEM_USE_HOST_PTR instead of CL_MEM_COPY_HOST_PTR .

Note that to take advantage of “zero-copy sharing” with CL_MEM_USE_HOST_PTR on integrated GPU devices, you generally have to use pointers aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN.

[QUOTE=utnapishtim;30166]Note that to take advantage of “zero-copy sharing” with CL_MEM_USE_HOST_PTR on integrated GPU devices, you generally have to use pointers aligned to CL_DEVICE_MEM_BASE_ADDR_ALIGN.[/QUOTE]yes, I forgot to mention that. Thanks.