Pinned memory with offset in OpenCL

Hey so I’m reading data in a file using mmap() as follows :

unsigned char* mapped;
mapped = mmap(0,size,PROT_READ,MAP_PRIVATE,input,0);

Then I created my host buffer and device buffer for pinned memory :

cl_mem pinned_buffer_input = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR, size, mapped, NULL);
cl_mem buffer_input = clCreateBuffer(context, CL_MEM_READ_ONLY, input_size, NULL, NULL);

Within a for loop I am :

[li]mapping the buffer :[/li]

void *pinnedMemory = clEnqueueMapBuffer(cmd_queue, pinned_buffer_input, CL_TRUE, CL_MAP_WRITE, header[3]+b*input_size, input_size_cur, 0, NULL, &ev, NULL);

[li]enqueuing the buffer :[/li]

clEnqueueWriteBuffer(cmd_queue, buffer_input, CL_FALSE, 0, input_size_cur, pinnedMemory, 0, NULL, &ev);

[li]unmapping the object :[/li]

clEnqueueUnmapMemObject(cmd_queue, pinned_buffer_input, pinnedMemory, 0, NULL, &ev);


Here mapped contains the whole file and is of size size. What I want is to have buffers of size input_size (or input_size_cur, same thing to simplify) to send data by blocks. So the offset is header[3]+b*input_size where b is incremented in the loop but it copies wrong data.

If I don’t initialize pinned_buffer_input with mapped then I can get a pointer to the host buffer with clEnqueueMapBuffer() and copy the data of mapped to that place :

memcpy(pinnedMemory, mapped+header[3]+b*input_size, input_size_cur);

By doing so it works but I want to avoid the memcpy as it is in a for loop and it creates huge delays in my program. To solve this problem I wanted to use the offset parameter of clEnqueueMapBuffer() but it screws up.

With CL_MEM_COPY_HOST_PTR instead of CL_MEM_ALLOC_HOST_PTR the result is correct but it takes ages to create pinned_buffer_input.


A few comments here:

1: The “mapped” parameter to your first clCreateBuffer is probably being ignored. The host_ptr parameter is only used if the flags contain CL_MEM_USE_HOST_PTR or CL_MEM_COPY_HOST_PTR. This might explain why you get incorrect data in some cases.
2: I don’t see a lot of people using an explicit “host buffer” and “device buffer” like you’ve set up here. I would just have one buffer (for the device), and then manage how you share between the host CPU and the device. Which leads me to my next point.
3: The most efficient way to do what you are trying to do often depends on the architecture of the device you are using. If you have a discrete GPU, the right advice might be to explicitly copy to the device in your loop using something like clEnqueueWriteBuffer. If you have an integrated GPU, the right answer might be to avoid copies and just use the CL_MEM_USE_HOST_PTR flag when you create the buffer. You’ll have to think a bit about how host and device accesses are synchronized in this case.

Hope this helps.