AMD’s MatrixMultiplication example utilizes shared (ie, __local) memory to reduce the number of fetch ops. In doing so, the code implicitly assumes that workgroup of blocksize x blocksize contains a work_items pointing to a contiguous block of the input matrices. More precisely, it assumes that within a workgroup, if work_item with get_local_id() = (0,0) has get_global_id() = (m, n), then the work_item with get_local_id() = (a, b) must have get_global_id() = (m+a, n+b). All kinds of requirements about the divisibility of global dims by work_group dims hints at this layout, but as far as I can tell the spec doesn’t come out and require that work_groups be assigned as contiguous blocks. Am I missing something, or is this missing from the spec? Knowing how workgroups are allocated will allow efficiencies like the MatrixMultiplication example.
Going one step further, knowing how workgroups are assigned to SIMD engines is also important. While I don’t have an example in mind of needing to know the absolute SIMD engine ID, knowing that two workgroups will execute on the same SIMD engine (at different times, obviously) would be very useful to some code I’m working on.