I have been wondering what the most optimal way would be to apply more than one kernel to a dataset. I.e. Take this case: say I want to apply a two pass blur (assume separate kernel scripts for each blur pass, i.e. 2 scripts in this case) to a 2D image (
image2d_t), where (1) I apply the first pass blur from the first kernel script (2) apply a second pass blur to that blurred image data, using a different kernel script. One very inefficient way to do this which i though about is to literally duplicate all lines of code which require a kernel input, like this:
std::string kernel_1("first_kernel.cl"); std::string kernel_2("second_kernel.cl"); cl::Program program1(context, kernel_1.file_as_string, true, &error); cl::Program program2(context, kernel_2.file_as_string, true, &error); //image data to be fed into kernel unsigned chat *image=.....; //create buffers etc. ..... //command queue cl::CommandQueue queue(context, device); //run first kernel script and get data output //.....(run first kernel code goes here) unsigned char *first_output=....; //feed output into second kernel //.....(run second kernel code goes here) unsigned char *second_output=....;
However this will not scale well, especially when multiple kernels are applied to larger images. Because everytime we want to apply a kernel to the image, we need to transfer data from host->graphics card and back, so the time will double with more kernels being added.
So my Q is: is there a more performant way around this? How can I apply multiple kernels, without having to transfer data between the host<–>device every time I run a kernel?
Hope i explained it clearly