And I need to put subregion of the matrix into the local memory:
For the time being I manually calculate how many element copy operations each work-item within a workgroup should do. The code is not very simple and it will become much more complex as the dimension count of “matrix” become variable (more than 2).
But I know the initial offset, the number of continues regions I need to copy and the “distance” in global buffer between these regions. May I somehow use async_work_group_strided_copy function efficiently here instea? ?? manual calculations?
async_work_group_strided_copy is useful when you have an array of structures (AoS) and you want to transform it into a structure of arrays (SoA), or more specifically, when you have an array of structures and want to extract one of the struct fields.
In your example, the “width” of your sub-matrix would need to be a builtin CL type, like an int, or a float4.
If you want to do a rectangular copy, I recommend executing async_work_group_copy() in a loop. Each iteration of the loop would copy one row of the sub-matrix into local memory. The number of iterations of the loop would match the height of the sub-matrix.
Thanks a lot! My submatrix width is variable, right now it is 5 in one kenel and 6 in another. I don’t think there are built-in types with such a width.
I already tried using async_work_group_copy in cycle. It is slow. I guess it is because the width is much smaller than local worksize thus a lot of workitems are just doing nothing. I end up with several times more wavefront’s memory requests than when I organize load manually.
Thanks again, I am now confident that I am using the best approach
It can’t even claim to be transposing across a workgroup because it is a global <-> local memory copy (thus just deferring the transpose to when the read from local memory happens), as opposed to into private kernel variables. Unfortunately it also isn’t especially useful for rectangular copies because it can only extract 1 gentype-per-row, so at most your rectangle can be 16 elements wide. This is a fixed-stride gather/scatter function (perhaps better called a pack/unpack function?), which limits its utility.