In my code, I have an private array:
__private float foo;
To make sure it stays in the registers and out of high-latency local memory, I use array offsets that are computed at compile-time. When I’m finished filling it with computed data, I want to transfer the array to thread-specific offset in global memory.
If I do the transfer this way:
__local float bar = foo; *(output_data + get_global_id(0)) = test;
then it’s nice and fast. But if I do the transfer this way:
*(output_data + get_global_id(0)) = foo;
then it’s horribly slow. Really s-l-o-w.
I’ve gone through the various docs multiple times, but I still can’t figure out why this is. If it was a problem with global memory coalescing, wouldn’t it manifest itself in both examples? Can anyone enlighten me?