Concurrent Kernel and data transfer on multi-GPU systems

I am executing two kernels concurrently in two NVidia C2070 gpus; to make sure that they are actually running concurrently I’m using opencl events to time the kernel execution, and the wall clock also (including barriers) to time the total execution time, which in general is reasonable, say: TotalTime = KernelTime + BarrierTime (where TotalTime is timed using the wall clock and KernelTime is timed using opencl events).

Now, If I transfer info from GPU to GPU (using clEnqueueCopyBuffer) the Total Execution Time measured with the wall clock should be close to: TotalTime = KernelTime + CopyTime + BarrierTime (where KernelTime and CopyTime are timed using opencl events), but it is not, in fact it is much greater that the summation of the times measured using opencl events.

At this point I’m kind of clueless about this behaviour, has anybody ever used transfer from GPU to GPU ? caz I’m guessing that that is what is giving me some issues. Any ideas are welcome.

Thanks,

TotalTime = KernelTime + BarrierTime (where TotalTime is timed using the wall clock and KernelTime is timed using opencl events).

How did you measure wall clock time? Did you take into account that clEnqueueXxx() calls are asynchronous?

What do you mean by BarrierTime? How did you measure it?

I used gettimeofday() to measure the wall clock time. And yes, I know that clEnqueueXxxx() calls are asynchronous, that’s why I use barriers to guarantee that when I take the wall time at the end, the opencl function calls are actually finished.

in a nutshell my code is:


// Barriers for queues
clFinish(queue[0]);
clFinish(queue[1]);
// Starts timing
start = gettimeofday( );
clEnqueueNDRangeKernel(queue_gpu[0], Kernel[0], ... );
// Pushes kernel for execution
clFlush(queue_gpu[0]);  
clEnqueueNDRangeKernel(queue_gpu[1], Kernel[1], ... );
// Pushes kernel for execution
clFlush(queue_gpu[1]);
// Barriers for queues  
clFinish(queue[0]);
clFinish(queue[1]);
// Stops timing
stop = gettimeofday( );
// Prints the time
printf("%f", stop - start);

By barrier time I mean the time that takes the execution of clFinish() and clFlush() that is aproximately between 200 and 300 microsecs. (In the above code queue[0] and queue[1] are attached to 2 different GPUs in my system).

And so, as I said if I execute the above code, the results are reasonable. But when I include a transfer from queue[0] to queue[1] the wall time increases too much, which to me is unreasonable.

Thanks,

TotalTime = KernelTime + BarrierTime

This is not a valid time measurement. What you call BarrierTime is the time that the CPU is waiting for the GPU to be done executing an NDRange, which is not very useful. It is merely going to be a value somewhere around the range from zero to KernelTime. The equation above is fundamentally flawed.

I would recommend simply using the values you get from clGetEventProfilingInfo().

Thanks David for your reply; I understand your point, although I do not totally agree. However I am not actually interested in measuring the barrier time, but what I believe is that the time measured with the wall clock should be approximate to the kernel time, do you agree on that ?

Finally, I need the wall clock to verify that both kernels are actually executed concurrently, i.e. TotalTime ~ KernelTime, because if they are not executed concurrently (i.e. sequentially), the TotalTime ~ 2 * KernelTime, correct ?

Given this, what would be your suggestion to verify that both kernels are actually executed concurrently (in parallel) ?

Thanks.

It would help if you included more detail in your code: ‘when i include the transfer …’ is pretty meaningless, particularly if your code is still littered with clfinish() which can easily skew the results. Which is not a barrier either, barriers are fairly specific terminology so it would help to use the right meaning.

Anyway, clfinish waits for the entire process to finish on the first queue before you wait for the second; it doesn’t really make sense for timing on multiple queues. I haven’t tried anything like this but my first approach would be to create an event dependency graph so that everything executes in the right order, and wait for the last event in the graph to complete.

It would help if you included more detail in your code: ‘when i include the transfer …’ is pretty meaningless, particularly if your code is still littered with clfinish() which can easily skew the results.

What you mean by “can easily skew the results” and how would that be ?

Anyway, clfinish waits for the entire process to finish on the first queue before you wait for the second; it doesn’t really make sense for timing on multiple queues.

Agree, but the clFlush(queue[1]) pushes the second kernel on the other gpu for execution, so at that point both kernels are executing, therefore clfinish(queue[0]) is waiting for the first kernel to be done, but remember that both of them are being executed in different gpus; and also clFinish blocks the execution on the CPU, the GPUs are free to continue executing whatever they are executing. So, I do think it make sense for timing multiple queues.

I haven’t tried anything like this but my first approach would be to create an event dependency graph so that everything executes in the right order, and wait for the last event in the graph to complete.

In this case there is no dependencies, both kernels are independent and need to be executed concurrently. As I said in my previous post I’m using the wall clock to verify that both kernels are executed concurrently in parallel, because that cannot be determined using OpenCL Events, or am I wrong ?

Thanks.

It was a polite way to say that it wont give you anything useful.

Again, without more detail conjecture isn’t much use.

In this case there is no dependencies, both kernels are independent and need to be executed concurrently. As I said in my previous post I’m using the wall clock to verify that both kernels are executed concurrently in parallel, because that cannot be determined using OpenCL Events, or am I wrong ?

Thanks.

There is still a dependency for your timing: i.e. that both kernels are complete.

And what about this `transfer from queue[0] to queue[1]’ you mentioned but never clarified? Does that not add a dependency?

There is still a dependency for your timing: i.e. that both kernels are complete.

And what about this `transfer from queue[0] to queue[1]’ you mentioned but never clarified? Does that not add a dependency?

The code without clFinish and clFlush is as follows:


queue[0] = clCreateCommandQueue(context, device[0], QUEUE_OPTS , &result);
queue[1] = clCreateCommandQueue(context, device[1], QUEUE_OPTS , &result);
// Kernel creation, argument passing not included
// kernel[0] and kernel[1] "are assigned to" queue[0] and queue[1] respectively by assigning arguments properly
// variables with index 0 were created in queue[0] 
kernel[0] = clCreateKernel(OpenCLProgram, kernel, &err)    
err = clSetKernelArg(kernel[0], 0, sizeof(cl_mem), (void*)&(coefx[0])); 
err = clSetKernelArg(kernel[0], 1, sizeof(cl_mem), (void*)&(coefy[0])); 
err = clSetKernelArg(kernel[0], 2, sizeof(cl_mem), (void*)&(res[0])); 
// variables with index 1 were created in queue[1] 
kernel[1] = clCreateKernel(OpenCLProgram, kernel, &err);
err = clSetKernelArg(kernel[1], 0, sizeof(cl_mem), (void*)&(coefx[1])); 
err = clSetKernelArg(kernel[1], 1, sizeof(cl_mem), (void*)&(coefy[1])); 
err = clSetKernelArg(kernel[1], 2, sizeof(cl_mem), (void*)&(res[1])); 

size = Ntotal * sizeof(float);
offset = size;
err = clEnqueueNDRangeKernel(queue[0], kernel[0], 1, NULL, WSize, GSize, 0, NULL, &event0);
err = clEnqueueNDRangeKernel(queue[1], kernel[1], 1, NULL, WSize, GSize, 0, NULL, &event1);
// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &event1, &eventCP);

Main idea is that the kernel is partitioned into 2 gpus, and each kernel will generate half of the res array. After the execution of the kernel[1] in queue[1] I need to transfer its half of res to queue[0].

I need to measure the total execution time of this code to verify that indeed the total time will be approximately KernelTime + TransferTime. That’s why I’m using the wall clock time.

Thanks for your help.

Thanks - this makes it much clearer what you want to calculate.

I haven’t done this so defer to others if they have, but my first attempt would be to try something like:


+ clFinish() on both queues
+ start = ...gettime()

size = Ntotal * sizeof(float);
offset = size;
err = clEnqueueNDRangeKernel(queue[0], kernel[0], 1, NULL, WSize, GSize, 0, NULL, &event0);
err = clEnqueueNDRangeKernel(queue[1], kernel[1], 1, NULL, WSize, GSize, 0, NULL, &event1);
// res[0] and res[1] were created in queue[0] and queue[1] respectively
err = clEnqueueCopyBuffer(queue[1], res[1], res[0], 0, offset, size, 1, &event1, &eventCP);

+ ... clWaitForEvents on eventCP and event0
+ finish = ...gettime()

Presumably you can get also useful information from the profiling info on the events (e.g. absolute timestamps), but i find that pretty clumsy to use. The nvidia profiler should also show overlapped execution.