clEnqueueCopyBuffer and multiple devices


I am struggling again with the relationship (or lack thereof) of buffers and devices.

Asume I have a buffer that I write to a device previously. I’d like to copy this buffer to another device (both are gpus). Which queue should I use in the call to the clEnqueueCopyBuffer function so I can be sure that the data is present on the second device afterwards?



As discussed in the previous thread, buffer objects are associated with contexts, not with devices. You do not need to “move” buffer objects from one device to another.

Yes, I do. I am programming with multiple devices.

I have a consistency model in place which is different from what OpenCL enforces.
I have a data management engine that ensures data prefetching is done as soon as it is determined where an operation will run and that not-needed buffers are freed when I run out of space on a device.
I have a scheduler in place that queues operations to devices in a HEFT (Heterogeneous Earliest Finish Time) model.

I am depending on knowing where the data is at the moment. I do load balancing on my own and take data availability/movement time into account. Therefore I use one buffer per data per device. And again therefore I need to be able to copy between the devices.

If you could just answer my question - which queue should I take in order to make sure that the data is actually copied to the correct device? The one on which the data already is or the one to which the data shall be copied?

It seems there’s a miscommunication between us.

Even in the presence of multiple devices, OpenCL associates each buffer with a context, not with a particular device. In particular, there’s no way to query in which device a particular buffer object resides – in fact it may be in multiple devices at once if for example all devices are reading from it and none is writing.

Hence, your question of “which queue should I take in order to make sure that the data is actually copied to the correct device?” has no meaning according to OpenCL. At best you may get answers of what particular implementations of OpenCL in particular devices will do.


I do understand your point. But, given the above scenario, there has to be a quidance which queue to use. If the guidance says it does not matter - that’s fine. But this is a clearly phrased question and there must be a clear answer.

Please have a look at the following code. It is written in C# but it maps one to one to OpenCL calls and the mapping should be clear. There is a short summary beneath.

    static class Program {

        static readonly String Kernel = "kernel void Kernel(global float * b) { b[get_global_id(0)]++; }";

        static readonly Int64 Count = 1024 * 1024 * 64;

        static void Main(String[] arguments) {
            Platform platform = Platform.GetPlatforms()[0];
            Device[] devices = platform.GetDevices(DeviceType.Gpu);
            Device device0 = devices[0];
            Device device1 = devices[1];
            Context context = new Context(devices);
            CommandQueue queue0 = new CommandQueue(context, device0, CommandQueueFlags.None);
            CommandQueue queue1 = new CommandQueue(context, device1, CommandQueueFlags.None);
            Single[] array = new Single[Count];
            Buffer buffer = context.CreateBuffer((UInt64)System.Buffer.ByteLength(array), BufferFlags.ReadWrite);
            CompiledProgram program = context.Compile(Kernel, null);
            Kernel kernel = program.CreateKernel("Kernel");
            kernel.SetGlobalArgument(0, buffer);
            queue1.WriteBuffer(array, buffer);
            EventObject e1 = queue0.StartKernel(kernel, new Int64[] { Count / sizeof(Single) }, null);
            EventObject e2 = queue1.StartKernel(kernel, new Int64[] { Count / sizeof(Single) }, null, e1);
            EventObject e3 = queue1.StartReadBuffer(buffer, array, e2);

In short, it does:[ul]
[li]Create a context over two devices[/:m:bva862pv][/li][li]Create a kernel which, when run, increments a bunch of floats[/:m:bva862pv][/li][li]Create a buffer that holds the floats[/:m:bva862pv][/li][li]Move data to the buffer[/:m:bva862pv][/li][li]Run the kernel on the first device[/:m:bva862pv][/li][li]Run the kernel on the second device[/:m:bva862pv][/li][li]Read back the results into host memory[/*:m:bva862pv][/ul][/li]
I’d expect this increments all values by two. It does, however, only increment them by one. So either one of the two kernel calls does not get the correct buffer or when reading the buffer it actually reads stale data.

Can you explain this according to the OpenCL 1.1 specification?

I do understand your point. But, given the above scenario, there has to be a quidance which queue to use. If the guidance says it does not matter - that’s fine. But this is a clearly phrased question and there must be a clear answer.

It does not matter which queue you use. In both cases the memory object will be made available to all devices in the context – because memory objects are associated with contexts, not with devices.

Your example application should indeed increase the contents of the buffer by two. Either there’s a bug in the application that I haven’t noticed, or there’s a bug in the C# bindings or in the OpenCL implementation that you are using.

Is it possible that “queue1.WriteBuffer(array, buffer);” is performing a non-blocking write? Or that any of the operations is returning an error?

The spec defines memory consistency between devices in terms of “synchronization points”. If you search the spec for the words “synchronization point” you will find all the related language. As you will find, event dependencies act as synchronization points so your application should work as intended.

Thanks a lot. I re-read the details about synchronization points and I came to the conclusion that there must be a bug in the ATI implementation.

Out of curiosity I implemented the same functionality using a different OpenCL wrapper (Cloo). To my complete surprise, it worked. I then migrated step-by-step the version that works into my original version that suffers from this bug. I realized that by accident in the Cloo version I used also the CPU device in addition to the two GPU devices. That is, the only difference was that the CPU device was also part of the context.

I come to the conclusion that the OpenCL API includes a bug that compromises buffer consistency across multiple GPU devices when only GPU devices are present in the context.

Where can I report this issue?

Google showed this: … entercat=y