Overhead caused by multi device usage dynamic workload distribution

I implemented an OpenCL application that performs matrix vector multiplications with a dynamic workload distribution.
The matrix and vector get split up into chunks and these get equeued into a queue.
For every OpenCL device there is a Thread started on the host that removes a chunk from the queue, sends the data to the attached OpenCL device and receives the result.
I started some tests and got unexpected results.

The available OpenCL devices are:
2x GPU: Nvidia Tesla K20c
1x CPU: Intel E5 1620 v2 (concurrently the host device)

In the first test I only used the CPU.
I split up the workload into 4 chunks and let the CPU process all of them. I checked how long it takes the CPU (as OpenCL device) to process a single chunk.
The result: it takes 13ms to processed one of the four chunks.

In the second test I used both GPUs as well as the CPU as devices. Again I split up the workload into 4 chunks.
The CPU porcessed two chunks and the GPUs process one chunk each.
This time it took the CPU 32ms to process a single chunk.

I figured out, that the enqueuWriteBuffer instruction, that sends the chunks to the GPUs causes the overhead of 19ms (32ms - 13ms).

Now I am trying to understand why there is an overhead of 19ms (32ms - 13ms). I use pinned memory to copy the data asynchronously to the GPUs.
My guess is, that the OpenCL devices have to share the Memory Bus and therefore the CPU can not access the memory permanently.

Did somebody encountered a similar problem, can explain the overhead or knows where to get information regarding this problem?