Hi Daniel,
This thread is quite long and I’ve tried to read the back & forth between you and David. Let me try to answer the question you raised in the original post which is - why does a command-queue need to be associated with a device and why do memory objects have no device affinity?
A command-queue is associated with a device because this is the mechanism that is used to dispatch commands to be executed on a device. This way, an application has complete control over what work is done on what device. For CPU OpenCL device, the command-queue typically maps to a pthread work-queue or a dispatch queue in GCD on Mac OS X. For GPUs there is a command-processor in front which processes commands from the command-queue and queues them to appropriate blocks. As to the reason why command-queues support both in-order & out-of-order queues this is because most GPUs today only support in-order queues. In addition, an in-order queue is very easy to understand and use for most developers. However, it is certainly possible that a device maybe able to process reading, writing, copying memory objects in parallel with executing kernels. This is certainly possible if the device has one or more DMA engines to do the read/write/copy operations. This is why the spec allows read, write & map operations to be blocking or non-blocking irrespective of the command-queue order.
Why are out or order queues supported (optionally)?
-
Provides more control in maximizing performance to developer by allowing the developer to specify the dependencies which control when a command can be executed.
-
Really needed if you are enqueuing data-parallel (clEnqueueNDRangeKernel) and task-parallel (clEnqueueTask) kernels.
NOTE: As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case.
Now onto the question of why memory objects do not have a device affinity associated with them. This was a very long discussion in the working group and I was one of the main proponents of not specifying a device affinity. As defined in CL, memory objects are associated with the context and therefore can be used by any device associated with the context. By doing this, an application does not have to worry about managing memory objects across devices. The device memory can be viewed as just a cache where only memory objects needed by a command(s) executing on a device need to be allocated. This way, the actual amount of physical memory available on the device does not limit how many memory object you can create - it only limits the amount of memory needed by memory objects used by a command. For example, the host memory can be used as a backing store for memory objects instead. In addition, managing memory objects across devices becomes simpler in the sense that all the application has to do is ensure correct event dependencies between commands that are using the same memory object and have been enqueued to queue A on device A and queue B on device B and the OpenCL implementation figures out how best to transfer data from the device which has the latest (dirty) copy to the device that needs the latest copy. For example, some devices may be able to do a direct PCIe to PCIe transfer which a user may not be able to use. The user now does not have to worry about making appropriate copies depending on which devices it plans to use the memory object. We move this responsibility to the OpenCL implementation. Since the device caches the data for the memory object, enqueuing a command to a device allows you (the app) to determine which device will have the latest bits.
In the example you give in the first post
cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);
you ask the question which device should the command queue be associated with. In this case, it should be the device where kernels that use this memory object are going to be enqueued for execution. As far as copying data to the host, I recommend that you use clEnqueueMap to map the region you want to read or write instead of copying.
I will stop here as the response is already too long and hopefully I was able to answer a few of your questions.