Data Feedback


I’m new to OpenCL and am trying to port the source code below to Nvidia GPU Quadro FX1700 using OpenCL. There are data feedback (i.e. alpha_t[s]=new_alpha_t[s]) in the nested loops so that the intermediate results are used in the subsequent computations. How do I achieve data feedback in the kernel? I used a global work size of 752 and local work size of 8 for my kernel. In addition, I perform a loop unrolling in the innermost loop (i.e. z) to achieve sum[0], sum[1], sum[2] and sum[3].

sm_lut[4][8] = {{1,5,5,7,6,3,0,5},

int s,m,z;
int alpha_t[8]={0};
int new_alpha_t[8];

for (m=0; m<752; m++) 
    for (s=0; s<8; s++) 
        int sum[4];

        for (z=0; z<4; z++)
            int sm1;
            sm1 = sm_lut[z][s];
            sum[z] = alpha_t[sm1];

    for (s=0; s<8; s++)

Thanks in advance for your help.

copy it back to host

If I understand you correctly, you want to feed data back between separate work-groups within one kernel execution. You can do this, but OpenCL does not allow synchronization between work-groups. This means you need to ensure at the algorithm level that you do not have any data races, and then you can do it through global memory. If you need synchronization (e.g., all work-groups are done before the next iteration) you have to do it through multiple kernel executions with the results stored to global memory.

I need synchronization (e.g., all work-groups are done before the next iteration) due to the innermost loop (ie. z) using alpha_t from the previous loop to compute the sum values in subsequent loops.

I used a global worksize of 6016 (= 752 * 8) and local worksize of 8 for my kernel. The data size of _beta0 - _beta3 is 6016 and they were computed from another kernel. However I am unable to get the correct results using the kernel below:

__kernel void forward(__global int *_dResult,
                      __global int *_beta0,
                      __global int *_beta1,
                      __global int *_beta2,
                      __global int *_beta3)
    const int sm_lut[32] = {0, 1, 6, 7, 2, 3, 4, 5,
                            5, 4, 3, 2, 7, 6, 1, 0,
                            1, 0, 7, 6, 3, 2, 5, 4,
                            4, 5, 2, 3, 6, 7, 0, 1

    volatile int alpha_t[8];
    volatile int new_alpha_t[8];
    int sum[4] = {0};

    int gid = get_global_id(0);
    int lid = get_local_id(0);
    int idx = lid << 2;
    int size = get_global_size(0) / get_local_size(0);

    for (int i = 0; i < 8; i++)
        new_alpha_t[i] = alpha_t[i] = 0;

    for (int t = 0; t < size; t++)
        sum[0] = alpha_t[s_to_sm1[idx]]     + _beta0[gid];
        sum[1] = alpha_t[s_to_sm1[idx + 1]] + _beta1[gid];
        sum[2] = alpha_t[s_to_sm1[idx + 2]] + _beta2[gid];
        sum[3] = alpha_t[s_to_sm1[idx + 3]] + _beta3[gid];

        new_alpha_t[lid] = max4(sum[0], sum[1], sum[2], sum[3]);
        alpha_t[lid] = new_alpha_t[lid];
    _dResult[gid] = alpha_t[lid];

Could u pls let me know how i could compute the (6016) results through multiple kernel executions?

Thanks for your help.

If you’re trying to synchronize across work-groups (which is not allowed in OpenCL except via multiple kernel executions) you just write out your new alpha_t values to global memory when you’re done with one execution, and then read them in with the next execution. That way you know that before the second kernel execution starts all the values from the first kernel execution have been written out. Ideally you would re-factor your algorithm to only need synchronization within a work-group and you wouldn’t have to incur that overhead. (If your kernel isn’t doing a lot of work on an iteration, the overhead of starting it can be a substantial amount of the total execution.)

Got it. Thanks for your help. :lol: