Does vload4 have any advantage over four individual buffer accesses for a local memory buffer?
__local int FOO;
// case 1
int4 pixel = vload4(0,FOO)
// case 2
pixel.x = FOO;
pixel.y = FOO;
pixel.z = FOO;
pixel.w = FOO;
Also, does vload4 execute in one kernel clock cycle (assuming no bank conflicts) ?
A compiler could theoretically tell that case 1 and case 2 are essentially the same. I have seen compilers do this in similar cases, but I can’t speak for all compilers. As such, I typically prefer the vload over separate loads so that I’m not relying on compiler tricks.
As to your second question, nothing in the spec makes clock-level performance guarantees about any operation. Implementation by carrier pigeon would be completely legal. If you have questions about the behavior on a specific platform, I suggest you talk to the hardware vendor of the device you are using.
Thanks kunze. Now, what about bank conflicts. If work item one issues memory reads from address 0 to address 4, and
the next work item reads from address 1 to address 5, then the individual reads would not exhibit bank conflict. However,
if vload is used, then it is possible that vload #1 would conflict with vload #2.
Again, the answer here would be architecture dependent. But for the architecture I use, one memory access with four lanes trying to access the same bank is no worse than four memory accesses with no bank conflicts. But this should be something that’s pretty easy to verify empirically on whatever you’re using.
Tried this out on HD 7700 series GPU: best perf was from individual loads, not vloadn.
With that amount of overlapped reads (work items re-reading the same memory other work items just read) this is a good candidate for workgroup shared local memory. Make those global memory reads just once, then read them as much as you need inside the work items. That will be faster than either individual loads or vloadn. You can code this yourself or use async_work_group_copy.