Hi,
I have been wondering what the most optimal way would be to apply more than one kernel to a dataset. I.e. Take this case: say I want to apply a two pass blur (assume separate kernel scripts for each blur pass, i.e. 2 scripts in this case) to a 2D image (image2d_t
), where (1) I apply the first pass blur from the first kernel script (2) apply a second pass blur to that blurred image data, using a different kernel script. One very inefficient way to do this which i though about is to literally duplicate all lines of code which require a kernel input, like this:
std::string kernel_1("first_kernel.cl");
std::string kernel_2("second_kernel.cl");
cl::Program program1(context, kernel_1.file_as_string, true, &error);
cl::Program program2(context, kernel_2.file_as_string, true, &error);
//image data to be fed into kernel
unsigned chat *image=.....;
//create buffers etc.
.....
//command queue
cl::CommandQueue queue(context, device);
//run first kernel script and get data output
//.....(run first kernel code goes here)
unsigned char *first_output=....;
//feed output into second kernel
//.....(run second kernel code goes here)
unsigned char *second_output=....;
However this will not scale well, especially when multiple kernels are applied to larger images. Because everytime we want to apply a kernel to the image, we need to transfer data from host->graphics card and back, so the time will double with more kernels being added.
So my Q is: is there a more performant way around this? How can I apply multiple kernels, without having to transfer data between the host<–>device every time I run a kernel?
Hope i explained it clearly