Apply multiple kernels "efficienctly"

Hi,

I have been wondering what the most optimal way would be to apply more than one kernel to a dataset. I.e. Take this case: say I want to apply a two pass blur (assume separate kernel scripts for each blur pass, i.e. 2 scripts in this case) to a 2D image (image2d_t), where (1) I apply the first pass blur from the first kernel script (2) apply a second pass blur to that blurred image data, using a different kernel script. One very inefficient way to do this which i though about is to literally duplicate all lines of code which require a kernel input, like this:

std::string kernel_1("first_kernel.cl");
std::string kernel_2("second_kernel.cl");

  cl::Program program1(context, kernel_1.file_as_string, true, &error);
  cl::Program program2(context, kernel_2.file_as_string, true, &error);

//image data to be fed into kernel
unsigned chat *image=.....;

//create buffers etc.
.....

//command queue
cl::CommandQueue queue(context, device);

//run first kernel script and get data output
 //.....(run first kernel code goes here)
unsigned char *first_output=....;

//feed output into second kernel
 //.....(run second kernel code goes here)
unsigned char *second_output=....;

However this will not scale well, especially when multiple kernels are applied to larger images. Because everytime we want to apply a kernel to the image, we need to transfer data from host->graphics card and back, so the time will double with more kernels being added.

So my Q is: is there a more performant way around this? How can I apply multiple kernels, without having to transfer data between the host<–>device every time I run a kernel?
Hope i explained it clearly