Pinned Memory Again

oneofthose · March 26, 2014, 3:54pm

Dear all,

I’d like to clarify the pinned memory issue for me, once and for all.
The specification is vague as well as overly complicated, so I have
a number of issues that I’d like to get out of the way.

The background of the question is: I’d like to create CUDA pinned
memory semantics in OpenCL.

Pinned memory is host memory allocated in a special way, with
certain properties, that might result in faster than usual transfer
times between host and device and vice versa.

In CUDA the API is really simple. We can do (pseudo-code):

ph = pinned_alloc_host(200);
d = alloc_device(200);
copy(d, ph, 200);

In OpenCL this does not exist (unfortunately). However, there is
something that might give similar behavior in terms of performance.
From here on, everything becomes unclear and I’d like you to
correct me or reaffirm my conclusions:

[ul]
[li]in OpenCL we can allocate a buffer on a device that has a
[/li]corresponding block of memory on the host (these are the
CL_MEM_ALLOC_HOST_PTR and CL_MEM_USE_HOST_PTR flags)
[li]CL_MEM_ALLOC_HOST_PTR will allocate this corresponding
[/li]block of memory on the host
[li]CL_MEM_USE_HOST_PTR will use an existing block of
[/li]memory on the host
[li]we get access to this block of memory on the host by
[/li]calling clEnqueueMapBuffer()
[li]under certain circumstances, this block of memory will
[/li]behave similar to the CUDA pinned memory in terms of
performance
[li]to achieve this kind of performance, the call
[/li]clEnqueueUnmapMemObject() must be used which takes as
an argument the pointer from clEnqueueMapBuffer(), which
represents the block of memory on the host and the original
cl_mem object that represents the buffer on the device,
created in the original clCreateBuffer() call
[/ul]

Is this correct, so far? Here are a few more questions:

[ul]
[li]can unrelated host and device memory blocks be transferred,
[/li]that were not created from the matching clCreateBuffer() and
clEnqueueMapBuffer() calls?
[li]how does clEnqueueReadBuffer() come into play here? can the
[/li]pointer obtained from clEnqueueMapBuffer() be used in
clEnqueueReadBuffer() or clEnqueueWriteBuffer?
[/ul]

Thanks for reading
Sebastian

Dithermaster · March 26, 2014, 6:41pm

You can use any host memory with clEnqueueRead/WriteBuffer. On NVIDIA hardware, the operations will go faster if the source or destination memory was allocated as pinned memory (using clCreateBuffer with CL_MEM_ALLOC_HOST_PTR). Also, they say that is the only way the operation can participate in overlapped copy and compute (which also requires multiple command queues). Check the NVIDIA overlap copy/compute example which shows how to allocate pinned memory. Also, the NVIDIA OpenCL programming guide discusses how to do it.

With AMD and Intel, there is no read/write buffer advantage using pinned memory as your source/destination that I know of. For AMD discrete GPUs, the fastest DMA is achieved using clEnqueueMapBuffer. For AMD APU and Intel HD Graphics, you can get zero-copy (instant) mapping of device buffers if you use clEnqueueMapBuffer (and use the right allocation flags; check the respective vendor programming guides).

Finally, both NVIDIA and AMD discrete GPUs have ways of accessing host memory from a kernel, which effectively combines the copy with the compute (the kernel runs slower but there is no copy operation).

oneofthose · March 27, 2014, 3:24am

Dithermaster, thanks a lot for answering! Your advice was great.
From the oclCopyComputeOverlap sample I get the following:

devBuf = clCreateBuffer();
hostBufPinned = clCreateBuffer(CL_MEM_READ_WRITE |
CL_MEM_ALLOC_HOST_PTR);
ptrHostBuf = clEnqueueMapBuffer(hostBufPinned);
// use ptrHostBuf like regular pointer
clEnqueueWriteBuffer(devBuf, ptrHostBuf);

It is nice that this works but I wonder if this was intended by the
OpenCL spec. clEnqueueUnmapMemObject is pointless in this
example.

The entire mapping business makes a lot more sense with APU and
Intel HD Graphics (due to zero-copy). For discrete cards, I am still
unclear when memory is allocated where and when memory is
transferred. And I suspect it differs between implementers.

Do you know of similar code examples from Intel and AMD? Code
seems to be the only thing we can rely on since the specification
is so vague. I think this is a big disadvantage. The specification
should be clear enough to not allow major differences in functionality
across implementers.

Sebastian

Dithermaster · March 27, 2014, 8:37am

The pinned memory read/write thing is unique to NVIDIA. Check on the Intel and AMD sites for their best practices / programming guidelines. A good order of operations is to understand OpenCL based on the spec and books, get something working, and then look to the vendor guidelines for optimization techniques.