Printing values from 1st workgroup but not subsequent ones

I did a simple parallel programming project to do same thing as in cuda. I know there is a amd maintained cuda to opencl conversion project in its public repository of rocm, but i wanted to it manually by myself to enforce my understanding.
Here is the cuda kernel function, lets say particular gtx card 128 max block size at most then I do something like:

add<<<(N+127)/128, 128>>> (dev_a, dev_b, dev_c);

With that I can print out the values from each of the thread.
If I exceed disregarding the max block size i.e. add<<<1, 1024>>> when block size is 128 then i found nothing returns only 0 even with the threads that fall withing the 128 limit.

I did something similar in open cl:

kernel void k_get_local_id(global uint *l_local_id, global uint *l_local_size)
       l_local_id[get_local_id(0)] = get_local_id(0);
       l_local_size[get_local_id(0)] = get_local_size(0);

so that l_local_id will return the thread ID for current work-group and l_local_size will return the size of workgroup.

With vega56 returning max global workgroup size of 256, I deliberately set global workitem as 2048 an local as 128 twice less than vega56’s.

No. of platforms available: 1
Platform 0: 531901744.
No. of devices available: 1
[0000]                           CL_DEVICE_NAME:                         gfx900.
[0000]                         CL_DEVICE_VENDOR:   Advanced Micro Devices, Inc..
[0000]                        CL_DEVICE_VERSION:                    OpenCL 2.0 .
[0000]                        CL_DRIVER_VERSION:             3098.0 (HSA1.1,LC).
[0000]               CL_DEVICE_GLOBAL_MEM_SIZE: 4278190080 (ff000000).
[0000]                CL_DEVICE_LOCAL_MEM_SIZE: 00065536 (00010000).
[-030]                CL_DEVICE_LOCAL_MEM_TYPE: 00 (00).
[0000]           CL_DEVICE_MAX_CLOCK_FREQUENCY: 1622 (0656).
[0000]             CL_DEVICE_MAX_COMPUTE_UNITS: 0056 (0038).
[0000]           CL_DEVICE_MAX_WORK_GROUP_SIZE: 0256 (0100).
[0000]      CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 0003 (0003).
[-030]           CL_DEVICE_MAX_WORK_ITEM_SIZES: 542613872 (2057a170).
[-030]                          CL_DEVICE_TYPE: 00 (00).
#define G_NWITEMS 2048
#define L_NWITEMS 128

here are the rest of the relevant code although partial:

   cl_uint *int_global_id, *int_global_size;
    int_global_id  = (cl_uint *) clEnqueueMapBuffer( queue, global_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
    int_global_size  = (cl_uint *) clEnqueueMapBuffer( queue, global_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );

    size_t global_work_size = G_NWITEMS;
    size_t local_work_size = L_NWITEMS;
    clSetKernelArg(kernel2, 0, sizeof(local_id_buffer), (void*) &local_id_buffer);
    clSetKernelArg(kernel2, 1, sizeof(local_size_buffer), (void*) &local_size_buffer);
    clEnqueueNDRangeKernel( queue, kernel2,  1,  NULL, &global_work_size, &local_work_size, 0,  NULL, NULL);
    clFinish( queue );

    cl_uint *int_local_id, *int_local_size;
    int_local_id  = (cl_uint *) clEnqueueMapBuffer( queue, local_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
    int_local_size  = (cl_uint *) clEnqueueMapBuffer( queue, local_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );

    inc =32;
    for(i=0; i < G_NWITEMS; i+=inc)
        printf("\n%2d: local_id: 0x%08x. local_size: 0x%08x", i, int_local_id[i], int_local_size[i]);


I am not sure what I did wrong but the resulting output will printout the values (local thread ID and local workgroup size) for first group only and not the subsequent ones.

 0: local_id: 0x00000000. local_size: 0x00000080
32: local_id: 0x00000020. local_size: 0x00000080
64: local_id: 0x00000040. local_size: 0x00000080
96: local_id: 0x00000060. local_size: 0x00000080
128: local_id: 0xbebebebe. local_size: 0xbebebebe
160: local_id: 0xbebebebe. local_size: 0xbebebebe
192: local_id: 0xbebebebe. local_size: 0xbebebebe
224: local_id: 0xbebebebe. local_size: 0xbebebebe

Can you take a look at this snippet to be sure it is doing what you’d like it to do?

       l_local_id[get_local_id(0)] = get_local_id(0);
       l_local_size[get_local_id(0)] = get_local_size(0);

Remember that get_local_id returns the ID within this work group. Since you have 128 items in your work group, the local ID for all of your work items be between 0 and 127 (inclusive), and nothing will be written to indices 128 or greater.

If you use get_global_id instead, which returns the global ID within the ND-Range instead of the local ID within the work group, do you get the results you’d expect?