Using non-square rectangular blocking for a matrix multiplication kernel

Bradley_M_Small · October 31, 2013, 2:09pm

I have been working with a kernel that does matrix multiplication.

The kernel is very much like the the common examples on matrix multiplication (can’t post a URL to it yet)

It uses 16 x 16 blocksizes. I have read that one could use rectangular block sizes (but that always seems to be “an exercise left to the reader”)

When I try them I am routinely getting -5 errors, so I know I am going somewhere I shouldn’t.

I assume I am not quite understanding how I am accessing the LOCAL (shared) memory, as well, I am not sure if the block is only relative to the output or actually either or both of the input matrices.

Can someone point me to a reference that might help me, or an example of a matrix multiplication that does in fact use rectangular blocking?

Thanks.

Bradley_M_Small · November 5, 2013, 1:36pm

OK, figured it out. For what I did the blocking had to be evenly divisible one by the other, and at least in the first case the width had to be greater than/equal to the height.

lisphacker · November 8, 2013, 7:17am

See Volkov’s paper on matrix multiplication in CUDA