Blocking memory transfers faster?

Hi,

I recently discovered something in my application, that sounds weird to me. I am using the C++ wrapper for OpenCL on Lion with an AMD graphics card. When I set the blocking flag in my cl::ComandQueue::enqueueWriteBuffer to true, all my memory transfers are about an order of magnitude faster, than when the call is placed non-blocking. (I measured using the corresponding cl::Events using CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START). Could it happen, that setting it to non blocking makes the queue try to transfer everything at once, instead of in order?

Funnily, when I use my CPU timer, from the start of my first CL call to after my cl::CommandQueue::clFinish, this timer actually tells me it is faster in the non-blocking version.
I am now unsure which timer to trust.

Has anybody had a similar issue?

Thank you
Herbert