Just in case some clarification is needed, each workitem needs to access 2 points: one with ‘i’ index, and another one with ‘j’ index. Reading the ‘i’ point is coalesced because ‘i’ is the first coordinate in the 2D kernel, so consecutive workitems are accessing consecutive floats in the array. However, reading the ‘j’ point isn’t coalesced because consecutive workitems have a constant second coordinate (until the first coordinate is incremented).
So, consecutive workitems are accessing the same array element when reading the ‘j’ point. I’d assume that consecutive workitems reading the same element in an array would be a kind of coalesced access, but it seems the GPU hardware doesn’t agree with me, because if I modify the kernel so that the ‘j’ point is a fake coalesced one, performance increases by 20%
Again, I lack a kernel profiler in my OS, so all of this is by trial and error.
It depends. If the array rarely changes and you need to access it a lot, make a copy of it in i,j (instead of j,i) order. Alternatively, store it in an OpenCL image instead, which has more fair access times in Y vs. X. Or learn how to use shared local memory so a workgroup teams up to do coalesced reads of the j data that other work items in the work group will need. Just some ideas, but without knowing more about your problem they are just guesses.
The array is 1D (it’s just an array of vertices). The kernel is 2D. It’s a numpoints^2 problem (there must be a workitem for all possible pair of vertices from the 1D array). So, each workitem reads the ‘i’ vertex (coalesced because ‘i’ is the first coordinate in the kernel) and also reads the ‘j’ vertex (uncoalesced because consecutive workitems have the same second coordinate).