EnqueueWriteBuffer for multiple Devices


In a multi-GPU environment, I experience problems with the enqueueWriteBuffer-method. The situation is as follows:

In my method to prepare the data, I create as many buffers as devices occur in my context (Context is a class, in which Context and CommandQueues for each device are created, device is the device-ID returned by the Context-class). I only post the parts of the code, which I think cause the problems.

cl::vector<cl::Buffer*> overlap_regions;
for (device = 0; device < participatingDevices; device++) {
		overlap_regions[device] = new cl::Buffer(this->context.getOpenCLContext(),CL_MEM_READ_ONLY,sizeof(T) * overlap_range * 2, NULL, &err);

This is simply done to allocate device memory.
Following the for-loop, I create the data I want to pass to the devices using the above Buffer. I use an array of size 2overlap_rangeparticipatingDevicessizeof(T). This array is supposed to be split since only some data is needed on each device (The first 2overlap_range elements are needed on the first device, the next 2*overlap_range elements are needed on the second device, and so on).
So I call the enqueueWriteBuffer-methods for each device as follows:

for (device = 0; device < participatingDevices; device++) {
		size_t size = 2 * overlap_range * sizeof(T);
		offset = device * 2 * overlap_range * sizeof(T);
		err = this->context.getCommandQueue(device).enqueueWriteBuffer(
				*overlap_regions[device], CL_FALSE, 0, size,
				(void*) (pOverlap_region + offset), NULL, NULL);

The enqueueWriteBuffer-methods return CL_SUCCESS every time (this is in my code, but I skipped it here).
In the called executeKernel(device)-method the kernel is actually executed for the passed device. The above created Buffer are set as argument as follows (the other arguments are skipped):

err |= kernel.setArg(3, *(this->overlap_regions[device]));

When I run the programm after compilation, it works fine and correct for one device. But when I use two or more devices, it seems that the enqueueWriteBuffer-methods do not work for the second and following devices. Still, the calculation on the first device is correct.
I also tried to block enqueueWriteBuffer with CL_TRUE-flag or waited for the CommandQueue to finish after the call. None worked.
I cannot figure out what causes the problems. I can give additional information, when needed. The behaviour is only tested on a NVIDIA Tesla plattform, since it is the only one I can access which has multiple devices (4). I appreciate your hints or help…