Device affinity for command queues and buffers seems at odds

Ummm… no, they form queues and this strange out-of-order concept is employed to allow independent commands to run concurrently. I’m suggesting:

  1. That you call it “a DAG comprised of commands that have dependencies” rather than “queues comprised of commands and events”. The specification and API should reflect this.
  2. That the DAG be bound to the context rather than bound to a specific device. I would go as far as to say that the DAG is a singlton for the context (after all, since the DAG can have multiple roots, what would multipe DAGs give you over a single one?)
  3. That queues of various sorts (FIFO, Priority, etc) can be implemented in terms of the DAG as a separate layer and do not form part of the OpenCL spec.

On motivation is to allow some level of extensibility in scheduling commands for execution. Currently there are two protocols (in-order and out-of-order) munged into the one API. Taking a more layered approach allows the API to remain clean and invariant as new scheduing policies are devised by the OpenCL community - the community would own the upper layers and they need not be standardised.

Perhaps to give an example, how could I possibly implement a priority-queue (a common concept in work queuing APIs) for OpenCL commands with the current OpenCL spec?

Cheers,

Dan

The Academic Contributor Members is “for accredited academic institutions only”, which I am not. The next level up is a Contributor Membership which would set me back USD$10,000 annually. This is beyond the reach of most small businesses, including mine. So, I guess the best I can do is make noise here and hope to influence someone.

I wonder how many would-be-contributors are excluded merely because of the high membership fees…

Ummm… no, they form queues and this strange out-of-order concept is employed to allow independent commands to run concurrently.

Remember that most commands need to be executed in a particular device, hence the convenience of having queues tied to particular devices.

In-order queues are great while you are prototyping your application because they are very easy to use. Once you have the algorithm running and producing the desired results, then you can enable out-of-order execution, set up the command dependencies accordingly and get some performance improvements. There’s nothing strange in OoO queues IMO.

how could I possibly implement a priority-queue (a common concept in work queuing APIs) for OpenCL commands with the current OpenCL spec?

Supporting priorities on GPUs is much more involved than you may think, and this has nothing to do with the architecture of OpenCL.

I wonder how many would-be-contributors are excluded merely because of the high membership fees

Many. On the other hand, too many cooks spoil the broth. It’s difficult to find a balance.

I haven’t suggested that the idea of in-order queues be dropped at all, nor that the API does not let you specify which device to execute a command on. I do think, however, that “convenience” should be built into higher layers.

I find it strange to define a DAG in terms of a queue, whilst I find it natural to define a queue in terms of a DAG. OpenCL is doing the former while I suggest moving to the latter.

I’m not sure that best way of striking that balance is to just exclude those without deep pockets. I don’t mind the W3C model where concessions on fees are made base on the annual turnover of the member.

Cheers,

Dan

Hi Daniel,

This thread is quite long and I’ve tried to read the back & forth between you and David. Let me try to answer the question you raised in the original post which is - why does a command-queue need to be associated with a device and why do memory objects have no device affinity?

A command-queue is associated with a device because this is the mechanism that is used to dispatch commands to be executed on a device. This way, an application has complete control over what work is done on what device. For CPU OpenCL device, the command-queue typically maps to a pthread work-queue or a dispatch queue in GCD on Mac OS X. For GPUs there is a command-processor in front which processes commands from the command-queue and queues them to appropriate blocks. As to the reason why command-queues support both in-order & out-of-order queues this is because most GPUs today only support in-order queues. In addition, an in-order queue is very easy to understand and use for most developers. However, it is certainly possible that a device maybe able to process reading, writing, copying memory objects in parallel with executing kernels. This is certainly possible if the device has one or more DMA engines to do the read/write/copy operations. This is why the spec allows read, write & map operations to be blocking or non-blocking irrespective of the command-queue order.

Why are out or order queues supported (optionally)?

  1. Provides more control in maximizing performance to developer by allowing the developer to specify the dependencies which control when a command can be executed.

  2. Really needed if you are enqueuing data-parallel (clEnqueueNDRangeKernel) and task-parallel (clEnqueueTask) kernels.

NOTE: As far as your comment on being surprised that clEnqueueNDRangeKernel on some implementations is blocking, I would suggest that you file a bug and work with the vendor in question to resolve this issue. It is certainly the intent of the spec and I know more than one implementation where this is not the case.

Now onto the question of why memory objects do not have a device affinity associated with them. This was a very long discussion in the working group and I was one of the main proponents of not specifying a device affinity. As defined in CL, memory objects are associated with the context and therefore can be used by any device associated with the context. By doing this, an application does not have to worry about managing memory objects across devices. The device memory can be viewed as just a cache where only memory objects needed by a command(s) executing on a device need to be allocated. This way, the actual amount of physical memory available on the device does not limit how many memory object you can create - it only limits the amount of memory needed by memory objects used by a command. For example, the host memory can be used as a backing store for memory objects instead. In addition, managing memory objects across devices becomes simpler in the sense that all the application has to do is ensure correct event dependencies between commands that are using the same memory object and have been enqueued to queue A on device A and queue B on device B and the OpenCL implementation figures out how best to transfer data from the device which has the latest (dirty) copy to the device that needs the latest copy. For example, some devices may be able to do a direct PCIe to PCIe transfer which a user may not be able to use. The user now does not have to worry about making appropriate copies depending on which devices it plans to use the memory object. We move this responsibility to the OpenCL implementation. Since the device caches the data for the memory object, enqueuing a command to a device allows you (the app) to determine which device will have the latest bits.

In the example you give in the first post

cl_mem buf = clCreateBuffer( context, CL_MEM_READ_ONLY, size, 0, 0 );
clEnqueueWriteBuffer( queue, buf, true, 0, size, ptr, 0, 0, 0);

you ask the question which device should the command queue be associated with. In this case, it should be the device where kernels that use this memory object are going to be enqueued for execution. As far as copying data to the host, I recommend that you use clEnqueueMap to map the region you want to read or write instead of copying.

I will stop here as the response is already too long and hopefully I was able to answer a few of your questions.

Hello Affie,

Your explanation certainly covers many of the motivations behind the design, however, I still think it is very confusing.

Given your explanation, I find it very strange that there is no mechanism to transfer the data from a cl_mem object to the host without specifying a device. After all, the cl_mem is global to the context and the OpenCL runtime knows where the current version of the data is stored - can’t it just give me the data?

Let’s take a contrived example to try and show up potential problems - say I have two devices, A and B. I create a cl_mem object and enqueue a command on device A’s command queue to copy data from the host to the cl_mem object. The data ends up in A’s memory, but not in B’s. I then enqueue a command on device B’s queue to read the cl_mem object back into the host’s memory. What happens? I expect that the OpenCL runtime will first copy the cl_mem data from A to B and then copy it from B to the host. Clearly it would have been more efficient to just copy directly from A… If the OpenCL runtime is smart enough and able to do the copy directly from A to the host (even though the command was queued on B’s command queue) then we can deduce a couple of things:

  1. It is irrelevant which command queue you enqueue commands that transfer data to the host.
  2. Not all commands enqueued on a given command queue actually execute on the command queue’s device.

This way, the actual amount of physical memory available on the device does not limit how many memory object you can create - it only limits the amount of memory needed by memory objects used by a command. For example, the host memory can be used as a backing store for memory objects instead.

I do like this philosophy (out of interest, do any OpenCL implementations you know of work this way in practice?). However, this scheme seems contrary to the statement that, “Since the device caches the data for the memory object, enqueuing a command to a device allows you (the app) to determine which device will have the latest bits.” There is no determinism if the OpenCL runtime is able to shuffle cl_mem objects around between devices or even out to host memory. If you truly think of device memory as a cache, then you have to expect evictions, cache misses and the lack of determinism that follows.

Let’s take another contrived example. Say I have want to run two different algorithms on the same data and I happen to have two devices in my context. What I want to do is transfer the data from the host to the context and then execute two different kernels on two different devices. Which command queue should I use for the memory transfer? Does it matter? I expect the answer to be that it doesn’t matter - you could use either command queue and the result should be the same. Kinda makes you think that being forced to pick a device is wrong…

I can understand that specifying which device a cl_mem object is expected to be used on when copying data from the host can lead to better performance as eager caching of that data can happen on the device. However, I would consider this merely a hint and should be optional in the API rather than requiring it to be specified. In the example above (one cl_mem object, two kernels), I would hint that both devices are expecting to use the cl_mem object, so eager caching on both devices may follow. How would you achieve this eager caching on multiple (lets say n out of m) devices with the current API?

Cheers,

Dan

All the calls that take context are blocking. Non blocking calls needs to go “enqueued on a queue” which is associated with a particular device. Suppose to want to issue a read command but dont want to wait for it so that if the device has dma capability it starts the transfer while you are doing other stuff on the main thread. By the time you are done with other stuff and really needs the data it might already be there. You need an event back from read command so that you can query if the read has already finished or not and again event are associated with particular queues and NOT context. Only enqueueing calls return event and they all take queue on which to enqueue and associate event with that queue. If you make the read command per context and not per device (queue), it has to be blocking and you cant do anything on main thread until this call is done.

Regarding your second contrived example …
I assume you are saying you “read” from this mem object in your kernel on both devices. If you use CL_MEM_COPY_HOST_PTR when creating mem object then yes, both devices will get the same data when respective kernels execute on two devices. But if you issue a non blocking write on one device then its applications responsibility to make sure that it take care of cross queue dependency meaning that it get the event back from clEnqueueWrite on one device and pass it in the event list to wait to clEnqueueNDRangeKernel on other deivce … this give great flexibility in terms of both read/write on one device and execution on one or both device becoming asynchronous freeing up cpu for more useful work … giving underlying scheduler more freedom to schedule anyway it wants for best efficiency as long as it keeps the dependency given by cross queue events. All this is possible because you make read/write “enqueued” or non-blocking and the reason it was possible is you enqueue read/write on a queue/device … if you do it on a context these operation will be blocking as, even though mem object knows who the current dirty owner is, context has no idea “queue” to issue a read/async dma from it.

Let’s change the scenario a little - let’s say I have N devices and want to make a buffer available to M of those devices, where 1 < M < N. How, in the OpenCL API, can I ensure that all M (and only those M devices) devices have the buffer pushed to it eagerly? If I use CL_MEM_COPY_HOST_PTR then it seems that an OpenCL implementation will probably push it to all N devices, which is inefficient when M << N. If I use asynchronous commands then I am forced to either perform the copy M times or perform the copy once (for one particular device) and leave the copying of buffers from one device to another up to OpenCL - no doubt this would hurt performance as an OpenCL implementation would copy on demand rather than preemptively.

Perhaps to state my opinion more clearly - memory buffer operations in the context of a single device are not (in general) useful. My opinion is that the command queue should never have been associated with a particular device. Commands that pertain to a particular device (such as executing a kernel) should take the target device in the API call to enqueue the command. Some commands would benefit from a “hint” as to which device(s) to target - for example, a command to copy from the host to OpenCL could provide a list of devices (say, my M out of N above) that OpenCL can use guide the way resources are allocated, caches preemptively populated, etc.

Yes I know. The tail end of you sentence is exactly the problem I am talking about.

Right - point in context. Even though the context knows where the buffer is, I can’t just ask the context to give it to me without nominating some device’s queue to perform the copy or mapping to the host. That makes absolutely no sense and is the exact scenario that started this thread.

Cheers,

Dan

I just wanted to support the ideas exposed by Dan.

The statements below express personal opinions.

For both symmetry and practicality (see the scenarios he exposed), the link between the command queue and the device should be revised, either to be weakened or removed.

Devices should be specified, if required, when commands/kernels are queued.

Solution 1: Allow NULL as the queue parameter in clEnqueueReadBuffer(), clEnqueueWriteBuffer(), etc.
Solution 2: Have functions clReadBuffer(), clWriteBuffer(), etc.
Solution 3: Use a dummy queue, perhaps on a CPU device that is otherwise unused.

In any case, I do assume that using a queue on a different device should still work correctly. If that device is busy, then I would expect the SDK to silently wait until it is not.

I guess it depends on what you think is the correct behavior. If I am trying to minimise latency or have deterministic behavior then silently waiting is behavior that I would need to avoid - but there is no way to avoid it with the current OpenCL API.

Cheers,

Dan