using the prefetch command


I’ve been trying to use prefetch to improve my performance, but haven’t seen any impact one way or another. I wonder if I’m using the command the correct way. I haven’t been able to find any code samples that show its correct use.

My code uses a loop to read chunks of data from global memory into local memory, and then process it. I use a barrier command to synchronize the threads, and an async_work_group_copy() command followed by a wait_group_event to transfer the data to local memory. Right after that, I kick off a prefetch command to the next chunk of global memory, and then process the data in the local memory. I think the next time I transfer data to local memory at the top of the loop, it should happen faster, but as I said, I don’t see any performance payoff.

Am I misunderstanding how to use prefetch()? Can anyone point me to the correct usage?

BTW, I’m using a compute capability 1.1 card.

Many thanks!

My understanding is that prefetch and async_copy aren’t really supported in hardware on most (if any GPUs). They might give great performance on something like a CELL system which has good DMA support. My guess is that prefetch is just a no-op on most systems today and that async_copy does a non-async copy in reality. So I would not expect better performance today, but it might be so in the future.

OK, good to know. It does seem that the async copy does the copy in a coalesced manner – that’s what the profiler says – but since I have a wait right after it, I’ve been using it as if it was a synchronous copy.