I’m current writing a smaller project in OpenCL, and I’m trying to find out what really causes memory coalescing. Every book on GPGPU programming says it’s how GPGPUs should be programmed, but not why the hardware would prefer this.
So is it some special hardware component which merges data transfers? Or is it simply to better utilize the cache? Or is it something completely different?
Checking address requests of neighbor workitems must be easier than checking future address requests of a workitem since it would need extra processing. Also getting all future data for a workitem would need more internal memory and increase register pressure. I “guess” there is a piece of hardware maybe something like a “sorting network” that re-order all memory requests in a compute unit, much faster than a software implementation.
I’m not a GPU architecture expert, but on CPUs, there are at least two reasons to prefer accessing neighbouring memory addresses:
[li]Data is fetched from memory in large chunks. If you access neighbouring data, you use the whole chunk, whereas if you access scattered data, some of that work is wasted.[/li][li]The memory subsystem has a very high latency. One way to hide this latency is for hardware to speculate on future memory accesses. For this, you need a predictable memory access pattern, and the default one assumed by hardware is that you’re reading every byte from the first to the last in a memory range. If you do not match this assumption, then you will pay the full memory latency cost.[/li][/ol]
Can anyone share a link to a paper / website giving guidelines and method to achieve this for some specific hardware? I realize this is very hardware dependent, but any example would be a welcome starting point. Does alignment matter? Does stride matter? How does the cache structure influence this? Does coalesced access apply to 2D and 3D processing (e.g. images stored in texture memory)?
Memory banks and memory channels are interleaved for addressing. Otherwise first n bytes would be serialized for a bad performance until one start using n+1 addresses.
Because of interleaving, it may be better to use a small prime number as stride. This way all workitems use more channels and banks. When you pick a stride of a big power-of-two value, some banks and channels are never used, hence, low performance.
[QUOTE=Tugrul;42786]Memory banks and memory channels are interleaved for addressing. Otherwise first n bytes would be serialized for a bad performance until one start using n+1 addresses.
Because of interleaving, it may be better to use a small prime number as stride. This way all workitems use more channels and banks. When you pick a stride of a big power-of-two value, some banks and channels are never used, hence, low performance.[/QUOTE]
Can you explain this better? I am having this problem and my algorithm is showing poor performances when running in NDRange mode. Also if possible how to go about when you use structures instead of simple float/integer buffers. Thank you very much.
Consider a CPU with dual channel memory. In this case memory addresses are distributed by interleaving: address 100 is located at the channel 1 and 108 is at channel 2, 116 at channel 1, etc. Imagine you iterate over array of structures consisting of two doubles and only use one component of those. This way you strictly ever use only one channel, effectively halving your memory bandwidth. This is called a channel conflict. Because GPU RAM generally has very wide bus, this effect is radically more impactful. To mitigate it, your structures should not be power of two sized (i.e. use padding). I’m not positive this is what constitutes “coalesced” though. Optimal memory access pattern is when a subgroup picks up a continuous chunk of memory (addresses should probably be in ascending order too).
Are you telling me not to use structures with size of a power of 2? Because they always suggest to do it. Also I am programming for FPGA mainly and I usually read that memory should be allocated using 64 as alignment. I am trying to optimize my code but I find it hard to understand what programming style/paradigm to follow since there is not a lot of material online. Could someone help me figure out this problem? Thanks
Your data can be both 64 aligned and not be power of 2 sized. By far the simplest paradigm that will resolve most of your memory related problems is using multiple arrays of basic data types instead of a singular array of structures. https://en.wikipedia.org/wiki/AOS_and_SOA