Apply multiple kernels "efficienctly"

Hi,

I have been wondering what the most optimal way would be to apply more than one kernel to a dataset. I.e. Take this case: say I want to apply a two pass blur (assume separate kernel scripts for each blur pass, i.e. 2 scripts in this case) to a 2D image (image2d_t), where (1) I apply the first pass blur from the first kernel script (2) apply a second pass blur to that blurred image data, using a different kernel script. One very inefficient way to do this which i though about is to literally duplicate all lines of code which require a kernel input, like this:

std::string kernel_1("first_kernel.cl");
std::string kernel_2("second_kernel.cl");

  cl::Program program1(context, kernel_1.file_as_string, true, &error);
  cl::Program program2(context, kernel_2.file_as_string, true, &error);

//image data to be fed into kernel
unsigned chat *image=.....;

//create buffers etc.
.....

//command queue
cl::CommandQueue queue(context, device);

//run first kernel script and get data output
 //.....(run first kernel code goes here)
unsigned char *first_output=....;

//feed output into second kernel
 //.....(run second kernel code goes here)
unsigned char *second_output=....;

However this will not scale well, especially when multiple kernels are applied to larger images. Because everytime we want to apply a kernel to the image, we need to transfer data from host->graphics card and back, so the time will double with more kernels being added.

So my Q is: is there a more performant way around this? How can I apply multiple kernels, without having to transfer data between the host<–>device every time I run a kernel?
Hope i explained it clearly

Don’t know if this is still needed…

Are the blur kernels executing on the same device?

This would make the processing easier without having to transfer data between devices. If not, the result of the first blur will to be migrated (transferred/copied) to the other device before executing the second blur. (See info on “clEnqueueMigrateMemObjects”) This is probably not desired.

Are the blur kernels operating on the data in-place?

Consider executing the second blur after the first one has finished with the same image/buffer.

Or are the blur kernels putting the results into a second image/buffer?

Consider executing the second blur on the destination image/buffer of the first blur, and placing it’s results in the original source image/buffer. This would require setting both images/buffers to write and read access.

Additionally, the blur kernels, using distinct names, could be placed into a single “.cl” file for the “cl::Program” object.

Good luck.

Thnaks for replying.

Are the blur kernels operating on the data in-place?

Consider executing the second blur after the first one has finished with the same image/buffer.

Yeah, I actually found this to be much faster and better a few months ago when I was working on the kernels. I apply the first pass of the blur kernel onto the image, then store that in a single buffer, then read from that same buffer and apply the second pass blur. And all blur code for both passes is contained within the same kernel. The speedup was huge by using this technique