Speed issues on clEnqueueNDRangeKernel with empty kernal


I’m trying to move some calcs onto OpenCL, but just calling empty kernels is causing quite a slowdown…

The initial serial calcs are effectively in a loop, calculating flows/pressure every xth of a second in a fluid network. All the flows have 2 associated pressure nodes and each node has 1 or more associated flows.

the basic structure is : (all on CPU)…

Set Initial Conditions at time t = 0.0
for i = 0 to number pipes
calc flows between 2 nodes (pipe)
for i = 0 to number valves
calc flows between 2 nodes (valve)
for i = 0 to number nodes
calc new pressures given flow for dT secs
t += dT
until bored

Obviously, I want to make these calcs parallel so I can do all ‘n’ associated calcs in one call.

I have created some kernels (one pipe flow, one valve flow and one pressure node) and setup args and buffers etc using clCreateBuffer, clSetKernelArg and clEnqueueWriteBuffer, but changing the above code to call empty kernels results in a slow down compared to doing the full serial calcs.

I’m effectively performing, after copying data around:

Setup Data for Initial Conditions at time t = 0.0

clEnqueueWriteBuffer(s) of all data required on device.

clEnqueueNDRangeKernel( pipe_kernel )
clEnqueueNDRangeKernel( valve_kernel )
clEnqueueNDRangeKernel( pressure_kernel )
t += dT
until bored

But this loop exhibits a considerable slowdown even though the kernels are empty.

The actual kernels are listed below, but does anyone know why clEnqueueNDRangeKernel calls should be so slow? I would have assumed that once the data had been copied accross to the device, clEnqueueNDRangeKernel just operated on the data, so there wouldn’t be a slowdown.

I’m using the Intel SDK and developing the code on the CPU.
The testing envirmoment I’m working on consists of 1 valve, 18 nodes and 16 pipes.
All local workgroup sizes set to 16 and all global sizes set to an integer multiple of 16.

Anyone got any ideas why calling clEnqueueNDRangeKernel on empty kernals should be so slow? I’ve not even got round to optimising the calculations or getting the results back onto the host…

Note: The kernels have been gradually reduced to empty after remming out code trying to find where the speed bottleneck was :slight_smile:


__kernel void k_A( __global double* dGlobalA
,__global double* dGlobalB
,__global double* dGlobalC
,__constant int* i_pA
,__constant int* i_pB
,__constant int* b_pC
,__constant int* b_pD
,__constant double* d_pE
,__constant double* d_pF
,__constant double* d_pG
,__constant double* d_pH
,__constant double* d_pI
,__constant double* d_pJ
,__global double* dGlobalD
,__global double* dGlobalE
, double d_pK
, double d_pL
, double d_pM
, double d_pN
, int iNumD )
long iBase = get_global_id(0) ;

if (iBase >= iNumD )
	return ;


__kernel void k_B( __global double* dGlobalA
,__constant double* d_rA
,__constant int* i_rB
,__constant int* i_rC
,__constant int* i_rD
,__global double* dGlobalF
,__global int* iGlobalG
,__constant double* d_rE
,__constant double* d_rF
,__constant double* d_rG
,__global double* dGlobalH
, double d_rH
, double d_rI
, int iNumF )

long iBase = get_global_id(0) ;

if (iBase >= iNumF )
	return ;


__kernel void k_C(__global double* dGlobalA
, __constant double* dGlobalI
, __global double* dGlobalJ
, __global double* dGlobalB
, __global double* dGlobalC
, __constant double* dGlobalK
, __constant int* i_nA
, __global double* dGlobalD
, __global double* dGlobalH
, __constant int* i_nB
, __constant int* i_nC
, __constant int* i_nD
, __constant int* i_nE
, __constant int* i_nF
, __constant int* i_nG
, __constant int* i_nH
, __constant int* i_nI
, __constant int* i_nJ
, __constant int* i_nK
, __constant int* i_nL
, __constant int* i_nM
, __constant int* i_nN
, __constant int* i_nO
, __constant int* i_nP
, __constant int* i_nQ
, __constant int* i_nR
, __constant int* i_nS
, __constant int* i_nT
, __constant int* i_nU
, __constant double* d_nA
, __constant double* d_nB
, __constant double* d_nC
, __constant double* d_nD
, __constant double* d_nE
, __constant double* d_nF
, __constant double* d_nG
, __constant double* d_nH
, __constant double* d_nI
, __constant double* d_nJ
, double d_nK
, double d_nL
, int iNumA )

long iBase = get_global_id(0) ;

if (iBase >= iNumA )
	return ;


You’re missing all the vital information, how big the problem is, how many iterations you’re actually timing, etc. Small problems up to some ‘n’ wont be faster as opencl has higher fixed overheads.

But one thing I can say for sure is that calling clFinish after every loop is about the worst thing you can do if you don’t need it. It’s akin to waiting for a production line to complete a whole car before even digging the iron ore out to start building the next one …

You’re timing the latency of each call: and opencl uses queues specifically to hide that (unavoidable due to physics) latency.


maybe I’m missing the point of clFinish, (and maybe OpenCL…)

Basically I’m moding a section of code running an iteractive simulation of a pipe network where users can modify the positions of valves etc in the network, either by manually opening/closing them, or adjusting controller setpoints etc. The simulation should run in real time ( ie if it takes 1 second fo a valve to close, the user should see it close in 1 second ) and the entire simulation can run indefinately. The networks can be as simple as 2 nodes with a valve, or contain thousands of nodes, pipes and valves.

In the existing code the main program/host has a variable ‘dTime’, which is how long the simulation has been running, and this is incremented by small value ‘dTimeInc’ each time round the loop.
At the end of each loop the system calculates all the variables for the new dTime += dTimeInc, and these become the starting point for the next loop.

Lets say that I start at t=0.0s with a timeinc of 0.1 seconds and I want the host to log the results of calculations every 1.0s.

On initialisation I create the kernels, set all the args and enqueue all the write buffers to get the data over to the device.

At t= 0.0 I enqeue the 3 kernels to calc the system conditons will be at t = 0.1s
On the next loop I enqeue the 3 kernels to calc the system conditons will be at t = 0.2s
At t = 0.9s I enqeue the 3 kernels to calc the system conditons will be at t = 1.0s
At t = 1.0s I need to get the data for t = 1.0s back onto the host

I assume that at this point I need to ensure the queue is flushed, else the results retrieved could be for one of the other calcs performed (maybe the calc for t = 1.0s is still in the queue…)
Am I correct in this assumption?

If I don’t clFlush the queue and instead just enqeue a readbuffer call at t = 1.0s, how can I be sure that the results of the read will = the results for 1.0s ?

Even if a readbuffer call would get the correct data, I need that data back in the host ‘instantly’ i.e. in the same loop/calc cycle of the host code. I can’t afford for the host to have moved on to t = 2.0s before the queued readbuffer request for data for t = 1.0s is finally processed and copied back onto the host…

Using the example I posted before, my code should perform the following:

t = 0.0

Main code
 calls fnCalcOpenCL()
 if t = 1.0s (or integer multiple thereof)
get values from OpenCl for current value of t
// <Synch point>
t += dt

Display OpenCl calced values for t
until false

calcs flows based on pressures
calc pressures based on net-flow in divided by time increment

At the <Synch point> I absolutely need the values calced by the opencl device and retrieved by the host to be valid for the time ‘t’ stored by the host program.
How can I ensure this is the case without clFlushing the queue to get the latest data?

I’m going to investigate kernel profiling to find out how long it takes to process the waits associated with an enqueueNDrange call as if it takes longer to queue a series of tasks to empty kernels on the opencl device than it would to simply do the calcs on the CPU, I’m barking up the wrong tree.


Any read will do an implicit flush, and it will also execute in order and to completion before the next job on the queue runs (they are not called a ‘queue’ just because it’s a cool word). You will of course always get the data you expect - otherwise how could anyone do any useful work with this api? If you do a blocking read it will also do an implicit finish(), but you don’t want to do this on the main processing queue or main thread. It’s all in the specification and programming guides …

flush() is only required if you’re doing asynchronous processing (i.e. you never implicitly or explicitly wait on that queue), and finish() is really only a debugging tool/utility for helping with opengl interop.

It’s this cpu<>gpu synchronisation that is slowing it down. You need to leverage the asynchronous nature of the api and pipeline the requests.

e.g. assuming the result of one step is the input to the next, you could first copy the results on-device to a rotating set of buffers, and then use a separate queue to download the results to the cpu. This can occur whilst the gpu is working on the next problem (you use events, and possibly threads, to wait for the results asynchronously and to synchronise the queues). At least double-buffer so the gpu can be working on the next problem while the data is being moved around, etc. The only ‘issue’ is that you have a one-time-step latency, but you should be able to do hundreds per second and it should be faster than real-time enough not to matter.

But having said all that …

Your new description doesn’t match your old one at all. Now you’re only synchronising once per second, not once per iteration.

That totally changes the picture - the overheads of synchronisation are now basically zero, and there’s really no point worrying about it.

Your test case of overheads for empty kernels should be more like this:

setup inputs for this second
for time = now to now + 1s step x
enqueue kernels for one iteration
read results with a blocking read.

If you really want you could always bung in a flush there, but it will likely make no appreciable difference. Absolutely do not do a finish within the loop.

Kernel profiling will by definition not be useful for what you want - it only times the kernel. You can use event profiling though to measure queue overheads, but if you do an internets search you can save yourself the hassle: it’s significant and people whine about it all the time.

BTW if you’re only talking thousands of inputs, the gpu might not be a good fit as it’s just not much work vs the overheads.

Thanks, that’s cleared it up a lot. I appreciate the help and the clarity of your explanation.