register to global memory performance mystery


In my code, I have an private array:

__private float foo[10][2];

To make sure it stays in the registers and out of high-latency local memory, I use array offsets that are computed at compile-time. When I’m finished filling it with computed data, I want to transfer the array to thread-specific offset in global memory.

If I do the transfer this way:

__local float bar = foo[0][0];
*(output_data + get_global_id(0)) = test;

then it’s nice and fast. But if I do the transfer this way:

*(output_data + get_global_id(0)) = foo[0][0];

then it’s horribly slow. Really s-l-o-w.

I’ve gone through the various docs multiple times, but I still can’t figure out why this is. If it was a problem with global memory coalescing, wouldn’t it manifest itself in both examples? Can anyone enlighten me?