Does anyone have experience with performance advantage from using async_work_group_copy() over regular coalesced
read of global memory ?

I tested this method out on AMD GCN 1.0 card, and found no difference.

See this AMD thread for more details:P
Judging from intermediate presentasion of Fiji kernel, ISA generated should be nearly identical. It might not be the case with global memory to global memory transfers though.

#define ASYNC_COPY
#define DATATYPE float
#define IDXTYPE int
#define OFFSET 1
__attribute__((reqd_work_group_size(256, 1, 1)))
__kernel void read_linear_uncached(__global DATATYPE *input,__global DATATYPE *output)
    event_t evt;
    IDXTYPE gid = get_global_id(0);
    IDXTYPE index = gid;
    local DATATYPE scratch[256*2];
    scratch[get_local_id(0)] = (DATATYPE)(0.0f);
    for (int i=0; i < 32; ++i) {
        uint flipBuffer = i&2;
#ifndef ASYNC_COPY
        scratch[get_local_id(0) + flipBuffer*256] = input[index];
         evt = async_work_group_copy(scratch + flipBuffer*256,
                                         input + index,
        index += OFFSET;
        for (int k=0; k < 100; ++k)
            scratch[get_local_id(0)+(flipBuffer^1)*256] += pow(scratch[get_local_id(0)+(flipBuffer^1)*256],2);
        wait_group_events(1, &evt); // waits until the copy has finished.
    output[gid] = scratch[get_local_id(0)];

Thanks for looking at the ISA, Salabar. So, it looks like this is just a convenience method on AMD GCN. Good to know :slight_smile: