How many transaction is required for one thread access a vector4 variable?

missyou · January 18, 2019, 8:18am

From the Nvidia programming guide, if a half-wrap of 16 thread access a consecutive global memory, then it may coalesce into one transaction. However, I’m wondering if one thread needs to access a consecutive global memory, would it coalesce into one transaction? I’ve tried to search on google but it seems like no one discuss this before. Here is an example.

kernel void opencl_test(global double* src, global uint * offset){
  uint gpu_global_id=get_global_id(0);
  //Cast a double-type variable into a double4 vector
  double4* src_vec=src+offset[gpu_global_id];
  //Do some operations on src_vec.
}

In my application, it is hard to know how would each thread access the global memory. It is only safe to assume the thread will access a consecutive memory, so that’s why I would like to do such optimization, but I’m not sure if the memory access in this example would be coalesced? If not, is there any way to optimize the code? thanks in advance!

Dithermaster · January 19, 2019, 2:06pm

Coalescing happens with many different size reads. A helpful mental model is to think of the reads from the first work item, and then if the reads from the second work item are adjacent, they will coalesce, limited by alignment and the maximum read size of the SM. Keep considering work items until you’re reached the maximum read width. So (for example) if the read size is 64 bytes, 64 work items reading bytes 0-63 would coalesce, as would 8 work items reading 8 bytes each (e.g., a float2).

Note that there is only one read in your example, the one from offset, which is based on the global_id. So this will be coalesced. However, reads from the computed src_vec address may not be depending on values found in offset array.

Also, your comment says "Do some operations on src_vec " – if these are read operations, you’d be best off reading once and re-using that value.

missyou · January 21, 2019, 1:36pm

[QUOTE=Dithermaster;44163]Coalescing happens with many different size reads. A helpful mental model is to think of the reads from the first work item, and then if the reads from the second work item are adjacent, they will coalesce, limited by alignment and the maximum read size of the SM. Keep considering work items until you’re reached the maximum read width. So (for example) if the read size is 64 bytes, 64 work items reading bytes 0-63 would coalesce, as would 8 work items reading 8 bytes each (e.g., a float2).

Note that there is only one read in your example, the one from offset, which is based on the global_id. So this will be coalesced. However, reads from the computed src_vec address may not be depending on values found in offset array.

Also, your comment says "Do some operations on src_vec " – if these are read operations, you’d be best off reading once and re-using that value.[/QUOTE]

Thanks for your reply. As your said “if the read size is 64 bytes, 64 work items reading bytes 0-63 would coalesce, as would 8 work items reading 8 bytes each”, do you imply that if 1 work items read 64 bytes them only 1 transaction is required? I might improperly use the terminology “coalesce” , because there is no second worker to read the adjacent memory, one worker is trying to ultilize all the read bandwidth, so I’m curious in that would there be any performance benifit in using double4 type(or even double8 for 64 bype) than the regular double type? Although Nvidia said there is no benifit in performance but compatibility. However, if there is 8 transactions for a worker to read a consecutive 8 double values, compared with 1 transactions to read a double8 values, I will wonder the actual performance would be different.

Dithermaster · January 23, 2019, 10:31pm

Right, coalesce only applies to using a single read to service multiple work items. If you’re able to read the full read width in a single work item you aren’t going to coalesce but you may be able to use full read bandwidth. My choice of 64 bytes was just an example; check your own hardware for what your read width is. You’d need to use a vector type to get that wide (e.g., int16, long8, float16).