When was data transfered for the clEnqueueNDRangeKernel?

I called clEnqueueNDRangeKernel many times with the same kernel to the same queue, and every time only a small part of parameters were changed.
I think only the changed parameters need to be transfered to the queue, is that true?
I used to think it is the clSetKernelArg that transfered data, but I was told it is not.
But before I called clEnqueueNDRangeKernel, no other command is more suspicious than clSetKernelArg.
Who can tell me the truth?

My mental model is that clSetKernelArg just changes data on host side, and clEnqueueNDRangeKernel is what sends it to the GPU queue. My assumption is that a copy of the parameters are made during that call. So you should be able to only change the few parameters that are changing and queue it up again. If you are worried that arg change after enqueue somehow are affecting the enqueued kernel, you can write a small test kernel to prove it one way or the other.

Thank you Dithermaster.
What I am being worried about is the bus time to be occupied.
If unchanged parameters are still need to be tranfered again and again, I will change my code.
My program gets the right results, but the spending time not changed when I use 2 devices, I am looking for the reason.

The maximum size of all parameters has to be less than 1K or 2K (I’m not going to check the spec, but it’s something like that). The setup for such a transfer costs more than the transfer itself. I’d be surprised if you could even measure a difference in launch time for a kernel with many parameter versus one with few. Maybe. In any case, measure before assuming you need to optimize this. If kernel launch time is a big part of your runtime, maybe you’re launching too many short runtime kernels and should figure out how to make them run longer.

Thank you, Dithermaster.
Sorry for my poor English, I think I’d better present some code:

//the kernel has 4 arguments
size_t global_size[1];
global_size[0] = 10000;
for(j=0; j<10; j++)
…cl_mem buff0 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,…);
…cl_mem buff1 = clCreateBuffer(context, CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR,…);
…clSetKernelArg(kernel, 0, sizeof(cl_mem), &buff0);
…clSetKernelArg(kernel, 1, sizeof(cl_mem), &buff1);
…for(i=0; i<10; i++)
…cl_mem buff2 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,…);
…cl_mem buff3 = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,…);
…clSetKernelArg(kernel, 2, sizeof(cl_mem), &buff2);
…clSetKernelArg(kernel, 3, sizeof(cl_mem), &buff3);
…clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_size, NULL, 0, NULL, NULL);
…clEnqueueReadBuffer(queue, buff1, CL_TRUE, 0,…);

I think buff0 and buff1 were transfered to device 10 times, and buff2 and buff3 were transfered to device 100 times.
Is that right?

You counts seems right. However, unless the contents buffer are changing for every use, there is no need to transfer them over and over, just leave them on the device and use them repeatedly. Since your pseudocode doesn’t show where the contents of the buffer come from, I can’t tell if they are unique for every kernel run.

I understand you Dithermaster. I’ll take care of unchanged buffers.
That was the reason that I think the clSetKernelArg transfered the data, it counts just the times.