Running the same kernel on multiple devices

What is the general technique when you want more devices (in same context) to run the same kernel on same memory? How do you split the workload? (For example I want first device to calculate first half of the job, and second device second half of the job (in same memory).)

global_work_offset parameter in clEnqueueNDRangeKernel would really be handy for that, but currently isn’t supported.

Thank you

The general approach should be to create separate cl_mem objects for each device, and dynamically size them based on the devices’ performance. I would be a bit weary of using one cl_mem object and accessing it from two devices at once. While the CL model should allow sharing one cl_mem to work as long as you don’t write the same data from both devices, you may end up copying all the data to both devices. This is because CL has no way of knowing how much of the data you are going to touch. Unfortunately this splitting isn’t automated, and if you have a heterogeneous mix of devices you’ll need to adjust how much each one processes to get the best throughput.

I was afraid of that. I assume this also applies for SLI and Crossfire networked devices?

For my problem, logical solution requires the readability of whole memory inside kernel. It’s not fun to complicate things…
Based on what should one decide how to slice the task and share the slices with device? It should be a function of devices type, compute units, and clock frequency?

global_work_offset :cry:

Thank you

I honestly have no idea how SLI and Crossfire work. For rendering my impression was they just alternate frames, which means all textures are replicated across both cards. I don’t know what they do for compute.

You can get a rough estimate of how to divide up the work based on the frequency and number of compute units for the device. However, the best thing to do is just divide it up as 1/n and measure the time over a few iterations and re-balance.

I suspect the global offset will arrive in the future, but I don’t know when. Until then you can always pass in an int4 to your kernel and then use:
idx = get_global_id(0)=my_offset.x;
and the like. This will add an additional add instruction, but will get you the global offset. You then just need to only enqueue a global size for the device equal to the portion it should process and you’ll have the same thing.

Thanks for the hints.
I was thinking the same offset trick. I just hope OpenCL is smart enough to manage 1 read/write buffer on all devices. I promise every device will write on their own buffer spots.

OpenCL isn’t smart enough to manage one read/write buffer across multiple devices. All you’re guaranteed (and I’m not even sure that’s the case since this behaviour might fall into the undefined category) is that if you enqueue a write then do a finish then you should read the same data on all devices that read the memory object. If you write back, there’s no telling what you’ll get. The implementation might take the copy on device A and ignore the copy on device B, so any writes done to a memory object shared by the two on device B would be lost. You are probably okay as long as you are just reading (although you may use more memory) but you have to call finish after loading data into the cl_mem as there is no implicit synchronization between devices. This behavior may also change between platforms, but calling clFinish() should be reasonably safe.

Since I manage quasi-splitting (offsetting) by myself, it’s not problem for me to enqueue read multiple times - for each command queue one read: the part only A was writing, and then the part B was writing - if that method will guarantee me right data.
If that doesn’t do the job, I’ll just make output buffers for each device.

If you have two devices writing to the same cl_mem object at the same time, regardless of whether they write to unique sections, there is no guarantee in OpenCL that you will get the merged results in the end. You will have to use two cl_mem objects for writing. Reading should work better, but, again, this is pretty much “undefined” behavior if I understand correctly.

So the proper way that devices will get the right read-only buffer data would be to use clEnqueueWrite and clFinish for all command queues (all devices) using the same cl_mem object.

On more philosophical: would it be possible to clEnqueueWrite and clFinish on only one command queue, that OpenCL automatically prepares the data to share across every device in context?

Contexts are atm useless if memory sharing within devices in same contexts is so hard.

The proper way that devices will get the right read-only buffer that they can use at the same time is to have two buffers. CL makes no guarantees about the consistency of data used by multiple devices at the same time. It will, however, ensure that if you use data on device in a context and then on another one after the first one is done that the data is up-to-date. So if you want to share data between two devices at the same time you should duplicate it, but you can try accessing it from both at the same time. The latter is not part of the spec so the behavior (while probably mostly what you expect) is (as far as I know) undefined. The details of this should be in the CL spec, but basically the only consistency guarantees are that non-simultaneous uses of data in a context will always have the most recent data on the device being used.

I’ve encountered on one more obstacle in specifications, even when trying to implement “secure” data transfer method with multiple buffers.
Since clCreateKernel returns one object for all devices program has been built on, it’s impossible to use clSetKernelArg with different buffers for different devices. This forces one to make multiple cl_program objects (one for each device), build the programs for their device, and create separate kernels. Ugly.
clSetKernelArg could have optional parameter cl_device_id, since one kernel object for all devices limits the operations with kernel, like in this case.

Tho, nVidia’s OpenCL SDK is offering multi-gpu example, with the following unlogical solution to the problem:

    for(unsigned int i = 0; i < ciDeviceCount; ++i ) 
        workSize[i] = ...;

        // Input buffer
        d_Data[i] = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, workSize[i] * sizeof(float), NULL, &ciErrNum);

        // Copy data from host to device
        ciErrNum = clEnqueueCopyBuffer(commandQueue[i], h_DataBuffer, d_Data[i], workOffset[i] * sizeof(float), 0, workSize[i] * sizeof(float), 0, NULL, NULL);        

        // Output buffer
        d_Result[i] = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, ACCUM_N * sizeof(float), NULL, &ciErrNum);
        // Create kernel
        reduceKernel[i] = clCreateKernel(cpProgram, "reduce", &ciErrNum);
        // Set the args values and check for errors
        ciErrNum |= clSetKernelArg(reduceKernel[i], 0, sizeof(cl_mem), &d_Result[i]);
        ciErrNum |= clSetKernelArg(reduceKernel[i], 1, sizeof(cl_mem), &d_Data[i]);
        ciErrNum |= clSetKernelArg(reduceKernel[i], 2, sizeof(int), &workSize[i]);

        workOffset[i + 1] = ...;

reduceKernel[i] and clSetKernelArg usage in the example makes no sense to me.

You can create multiple kernels from the same program and set their arguments differently instead of duplicating the program.

However, you should be able to enqueue a kernel with one set of arguments, change the arguments and enqueue it again. The runtime should take care of keeping track of what arguments each enqueued kernel should use. Given that you wouldn’t even need to create multiple kernels unless it’s more convenient for your program.

So clSetKernelArg followed by enqueueing the kernel on specific devices tells OpenCL which device should get the data set by clSetKernelArg… interesting, thought unlogical and never mentioned in specs.

More interesting, this example does set the kernel args after creating same kernel instances (reduceKernel[i]) without enqueueing directly after setting the args. reduceKernel[i] all are the same kernel, since clCreateKernel has no device argument… Makes no sense. Am I missing something?

A kernel is defined to run on any of the devices for which it was built. You can then enqueue it on multiple devices. Each time you do that you specify the device to which you want to enqueue it via the command queue passed in to clEnqueueNDRangeKernel.

Think of cl_kernel as keeping track of which arguments are set for an instance of a kernel in a program. You can have as many of these argument sets (cl_kernels) as you want.

The arguments for a kernel stay the same until you change them. When you enqueue a kernel it is the runtime’s responsibility to keep track of which arguments were used for that enqueuing so after you have done that you can change them. It’s not complicated, but it is a bit unclear that you can change the arguments after enqueuing the kernel.

I know, you’re not following me. To repeat myself one more time, nVidia’s example acts contradictory to what we said.

Contradiction: reduceKernel[i] = clCreateKernel(cpProgram, “reduce”, &ciErrNum);

Contradiction: clSetKernelArg gets called in a loop, without any of enqueue function calls afterwards.
After the mentioned loop, comes the loop of clEnqueueNDRangeKernel for each device, and kernel.

What Nvidia is doing is fine. They appear to want to have N kernels with N different arguments so they can run them all at once. You’re right that I’m not following you here. I can build a program for all devices in a context and then enqueue a kernel from that program it on any device I want. I can also create as many kernels as I want. For example, if I wanted to have different arguments (as they appear to be doing) then I could either create multiple kernels with different arguments or setarg/enqueue multiple times. There’s nothing wrong with what Nvidia is doing, and, in fact, if you want to quickly re-enqueue the kernel with the same arguments this is a fine way to do it. My point is just that this is not necessary. If you build the program for all devices in a context, assuming it builds without errors, you can then execute kernels created from that program on any of the devices.

Does that make sense? If not then I’m definitely not following you. :slight_smile:

Thanks, I see what you mean… Calling clCreateKernel for the same kernel multiple times, gives you different cl_kernel instances. Now with different instances, you can set args without directly queuing afterwards. Multiple instances of same thing make some sense, but looking again make no sense at all :). In specs should also be said that creating kernel of same name multiple times return different cl_kernel instances, or maybe this is implementation defined…
I’ll stick to one kernel, and let OpenCL keep track of arg setting and queuing.

I haven’t read the spec for a while (and remember that it’s aimed at implementors, not users) but you can assume that on all implementations calling clCreateKernel multiple times will create multiple kernels safely. I would suggest you use whatever is most comfortable for your implementation as the overhead of dealing with multiple kernels should be negligible compared to the actual enqueuing and computation.

Just to make sure I’ve understood this correctly.

I have a class that encapsulates a set of kernels, and I want each instance of this class to have its own set of parameters sent to each kernel invocation. I have done this by having a static cl_program in the class, and instance variables for the cl_kernels. The first time an instance is created, the constructor compiles the .cl source and the static cl_program is initialized. Each instance calls clCreateKernel to get local copies of the kernel objects, and then clSetKernelArg to set the kernel arguments for this particular instance.

Is this a reasonable way to achieve what I want?

Yes, it’s is reasonable, nVidia’s example confirms it.