I want to chain together multiple kernels, but is it better to call the kernel functions from within a kernel or via the host.
Pseudo code below:
Kernel calling kernel
__kernel void vsubtract( __global float * a, __global float * b, __global float * c, const unsigned int count, unsigned int red)
int i = get_global_id(0);
if(i < count)
a[i] = b[i] - a[i];
c[i] = a[i];
a[i] = a[i] * a[i];
//call reduction kernel
reduction(a, count, red);
or host calling kernels
vsubtract(cl::EnqueueArgs(queue, cl::NDRange(count), cl::NDRange(local)), d_a, d_b, d_c, count, red);
queue.enqueueReadBuffer(d_a, CL_TRUE, 0, sizeof(float) * LENGTH, &vector_a);
reduction(cl::EnqueueArgs(queue, cl::NDRange(count), cl::NDRange(local)), d_a, count, red)
I would assume it would be faster to have the kernel calling the other kernels to avoid the additional data transfer with the host and the device.
Is there any issues that I need to be aware of if I have kernels calling kernels?
In most cases, your intuition is correct: Calling the two kernels as functions from within a unified kernel is usually preferable. There are a few things to consider when doing this:
- Sizes of the global and local NDRanges. It looks like they are the same in your case, which makes it easy. If they’re not the same, then you need to find a way to map the unified kernel’s NDRange sizes to the component kernels’ NDRange sizes.
- Cross-kernel dependencies. If your second kernel depends on the results of the first kernel, you need to make sure that depenency is satisfied. In your example, it looks like the first kernel produces a[i] and b[i]. As long as the second kernel only needs these same values, you’re OK. However, I’m a bit nervous given the name of your second kernel is “reduction.” Typically, a reduction will want to see the results of multiple work-items’ computation (a[i-1], for example). Watch out for this one.
- Skipping memory references. If your second kernel only depends on a[i] and b[i], and nothing else needs these values, consider just passing them into the function as floats and skip writing them out to memory. It might mean rethinking the call signature for the second kernel, but it might result in improved performance (if the compiler is not already doing this for you.)