The overhead of calling clEnqueueNDRangeKernel should be fairly small.
I guess the problem is the data transfer. If you use clCreateBuffer with CL_MEM_COPY_HOST_PTR, the data only gets copied to the device when you call clEnqueueNDRangeKernel, because only then does the runtime know which device is using the data. Try using clEnqueueWriteBuffer to copy data to your device and see if it makes a difference.
There may also be some extra overhead associated with the first launch of a kernel. You should measure several kernel launches and then average the results.
There is a noticeable overhead that scales with the size of the buffers pointed to in the kernel arguments associated with invoking enqueueNDRangeKernel() for the first time even if they’ve already been written to the device, at least when using Apple’s implementation in Snow Leopard. From what I’ve discovered this can be alleviated by invoking a dummy kernel (i.e one with no instructions) with the same arguments before running the actual kernel. Or alternatively just invoking the original kernel repeatedly (of course this will take longer).