OpenCL should do this automatically. Simply create your cl_mem objects, write in the initial data, and then enqueue your kernels in the order you want them executed. The runtime will try to do the best job it can of keeping the data on the device as long as possible. As long as all the data fits, you should get the best performance. If, for example, the data for kernel A fits all at once, but kernel B requires other data that does not fit with kernel A’s data, then the runtime will have to page data on-and-off the device.
My advice is to allocate your memory objects not using CL_MEM_USE_HOST_PTR (this may incur extra work to keep the host pointer synchronized) and then just enqueue your kernels. As long as you don’t do a clEnqueueRead/Write, the data should stay on the card. Make sure, however, that if your command queue is out-of-order that you use events to ensure the order of execution of your kernels. (If it’s in-order, you should just enqueue them in the order you want.)