Printing values from 1st workgroup but not subsequent ones

I did a simple parallel programming project to do same thing as in cuda. I know there is a amd maintained cuda to opencl conversion project in its public repository of rocm, but i wanted to it manually by myself to enforce my understanding.
Here is the cuda kernel function, lets say particular gtx card 128 max block size at most then I do something like:

add<<<(N+127)/128, 128>>> (dev_a, dev_b, dev_c);

With that I can print out the values from each of the thread.
If I exceed disregarding the max block size i.e. add<<<1, 1024>>> when block size is 128 then i found nothing returns only 0 even with the threads that fall withing the 128 limit.

I did something similar in open cl:

kernel void k_get_local_id(global uint *l_local_id, global uint *l_local_size)
{
       l_local_id[get_local_id(0)] = get_local_id(0);
       l_local_size[get_local_id(0)] = get_local_size(0);
}

so that l_local_id will return the thread ID for current work-group and l_local_size will return the size of workgroup.

With vega56 returning max global workgroup size of 256, I deliberately set global workitem as 2048 an local as 128 twice less than vega56’s.

No. of platforms available: 1
Platform 0: 531901744.
No. of devices available: 1
[0000]                           CL_DEVICE_NAME:                         gfx900.
[0000]                         CL_DEVICE_VENDOR:   Advanced Micro Devices, Inc..
[0000]                        CL_DEVICE_VERSION:                    OpenCL 2.0 .
[0000]                        CL_DRIVER_VERSION:             3098.0 (HSA1.1,LC).
[0000]               CL_DEVICE_GLOBAL_MEM_SIZE: 4278190080 (ff000000).
[0000]                CL_DEVICE_LOCAL_MEM_SIZE: 00065536 (00010000).
[-030]                CL_DEVICE_LOCAL_MEM_TYPE: 00 (00).
[0000]           CL_DEVICE_MAX_CLOCK_FREQUENCY: 1622 (0656).
[0000]             CL_DEVICE_MAX_COMPUTE_UNITS: 0056 (0038).
[0000]           CL_DEVICE_MAX_WORK_GROUP_SIZE: 0256 (0100).
[0000]      CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 0003 (0003).
[-030]           CL_DEVICE_MAX_WORK_ITEM_SIZES: 542613872 (2057a170).
[-030]                          CL_DEVICE_TYPE: 00 (00).
#define G_NWITEMS 2048
#define L_NWITEMS 128

here are the rest of the relevant code although partial:

...
   cl_uint *int_global_id, *int_global_size;
    int_global_id  = (cl_uint *) clEnqueueMapBuffer( queue, global_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
    int_global_size  = (cl_uint *) clEnqueueMapBuffer( queue, global_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );

...
    size_t global_work_size = G_NWITEMS;
    size_t local_work_size = L_NWITEMS;
    clSetKernelArg(kernel2, 0, sizeof(local_id_buffer), (void*) &local_id_buffer);
    clSetKernelArg(kernel2, 1, sizeof(local_size_buffer), (void*) &local_size_buffer);
    clEnqueueNDRangeKernel( queue, kernel2,  1,  NULL, &global_work_size, &local_work_size, 0,  NULL, NULL);
    clFinish( queue );

...
    cl_uint *int_local_id, *int_local_size;
    int_local_id  = (cl_uint *) clEnqueueMapBuffer( queue, local_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
    int_local_size  = (cl_uint *) clEnqueueMapBuffer( queue, local_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );

...
    inc =32;
    for(i=0; i < G_NWITEMS; i+=inc)
    {
        printf("\n%2d: local_id: 0x%08x. local_size: 0x%08x", i, int_local_id[i], int_local_size[i]);
    }

    printf("\n");
...

I am not sure what I did wrong but the resulting output will printout the values (local thread ID and local workgroup size) for first group only and not the subsequent ones.

 0: local_id: 0x00000000. local_size: 0x00000080
32: local_id: 0x00000020. local_size: 0x00000080
64: local_id: 0x00000040. local_size: 0x00000080
96: local_id: 0x00000060. local_size: 0x00000080
128: local_id: 0xbebebebe. local_size: 0xbebebebe
160: local_id: 0xbebebebe. local_size: 0xbebebebe
192: local_id: 0xbebebebe. local_size: 0xbebebebe
224: local_id: 0xbebebebe. local_size: 0xbebebebe

Can you take a look at this snippet to be sure it is doing what you’d like it to do?

       l_local_id[get_local_id(0)] = get_local_id(0);
       l_local_size[get_local_id(0)] = get_local_size(0);

Remember that get_local_id returns the ID within this work group. Since you have 128 items in your work group, the local ID for all of your work items be between 0 and 127 (inclusive), and nothing will be written to indices 128 or greater.

If you use get_global_id instead, which returns the global ID within the ND-Range instead of the local ID within the work group, do you get the results you’d expect?

thanks, I lost the examples or can not find. Now I modified the example and made it more readable code. and did several experiments and those all seem to make sense:

1st variant. In the kernel i got get_local_id function called and if you don’t specify the local_work_size in the clEnqueuNDRangeKernel calls (6th param), then it seems to default to max work size for the particular GPU in my case 256 ( I did not put all code put just only relevant changes):
In this case, global_size 256 (as expected) and global_id is limited to 256 and then reset back to 0 (apparently jumping to next workgroup) since array size is 4096.

"kernel void memset(     global uint *l_global_id, global uint *l_global_size)      \n"
"{                                                                      \n"
"       l_global_id[get_global_id(0)] = get_local_id(0);               \n"
"       l_global_size[get_global_id(0)] = get_local_size(0);           \n"
"} \n";
...
#define NWITEMS 4096
...
clEnqueueNDRangeKernel( queue, kernel,  1, NULL, &global_work_size, NULL 0,  NULL, NULL);

output:
0: local_id: 0x00000000. local_size: 0x00000256
100: local_id: 0x00000100. local_size: 0x00000256
200: local_id: 0x00000200. local_size: 0x00000256
300: local_id: 0x00000044. local_size: 0x00000256
400: local_id: 0x00000144. local_size: 0x00000256
500: local_id: 0x00000244. local_size: 0x00000256
600: local_id: 0x00000088. local_size: 0x00000256
700: local_id: 0x00000188. local_size: 0x00000256
800: local_id: 0x00000032. local_size: 0x00000256
900: local_id: 0x00000132. local_size: 0x00000256
1000: local_id: 0x00000232. local_size: 0x00000256
1100: local_id: 0x00000076. local_size: 0x00000256
1200: local_id: 0x00000176. local_size: 0x00000256
1300: local_id: 0x00000020. local_size: 0x00000256
1400: local_id: 0x00000120. local_size: 0x00000256
1500: local_id: 0x00000220. local_size: 0x00000256
1600: local_id: 0x00000064. local_size: 0x00000256
1700: local_id: 0x00000164. local_size: 0x00000256
1800: local_id: 0x00000008. local_size: 0x00000256
1900: local_id: 0x00000108. local_size: 0x00000256
2000: local_id: 0x00000208. local_size: 0x00000256
2100: local_id: 0x00000052. local_size: 0x00000256
2200: local_id: 0x00000152. local_size: 0x00000256
2300: local_id: 0x00000252. local_size: 0x00000256
2400: local_id: 0x00000096. local_size: 0x00000256
2500: local_id: 0x00000196. local_size: 0x00000256
2600: local_id: 0x00000040. local_size: 0x00000256
2700: local_id: 0x00000140. local_size: 0x00000256
2800: local_id: 0x00000240. local_size: 0x00000256
2900: local_id: 0x00000084. local_size: 0x00000256
3000: local_id: 0x00000184. local_size: 0x00000256
3100: local_id: 0x00000028. local_size: 0x00000256
3200: local_id: 0x00000128. local_size: 0x00000256
3300: local_id: 0x00000228. local_size: 0x00000256
3400: local_id: 0x00000072. local_size: 0x00000256
3500: local_id: 0x00000172. local_size: 0x00000256
3600: local_id: 0x00000016. local_size: 0x00000256
3700: local_id: 0x00000116. local_size: 0x00000256
3800: local_id: 0x00000216. local_size: 0x00000256
3900: local_id: 0x00000060. local_size: 0x00000256
4000: local_id: 0x00000160. local_size: 0x00000256

2nd variant:
If i set the 6th parameter of clEnqueueNDRangeKernel to custom parameter i.e. 128 then global_size is 128 upon return and global_id reset back to after reaching 128, seems it makes sense:

    size_t global_work_size = NWITEMS;
    size_t local_work_size = 128;
...
    clEnqueueNDRangeKernel( queue, kernel,  1, NULL, &global_work_size, &local_work_size, 0,  NULL, NULL);

 0: global_id: 0x00000000. global_size: 0x00000128
100: global_id: 0x00000100. global_size: 0x00000128
200: global_id: 0x00000072. global_size: 0x00000128
300: global_id: 0x00000044. global_size: 0x00000128
400: global_id: 0x00000016. global_size: 0x00000128
500: global_id: 0x00000116. global_size: 0x00000128
600: global_id: 0x00000088. global_size: 0x00000128
700: global_id: 0x00000060. global_size: 0x00000128
800: global_id: 0x00000032. global_size: 0x00000128
900: global_id: 0x00000004. global_size: 0x00000128
1000: global_id: 0x00000104. global_size: 0x00000128
1100: global_id: 0x00000076. global_size: 0x00000128
1200: global_id: 0x00000048. global_size: 0x00000128
1300: global_id: 0x00000020. global_size: 0x00000128
1400: global_id: 0x00000120. global_size: 0x00000128
1500: global_id: 0x00000092. global_size: 0x00000128
1600: global_id: 0x00000064. global_size: 0x00000128
1700: global_id: 0x00000036. global_size: 0x00000128
1800: global_id: 0x00000008. global_size: 0x00000128
1900: global_id: 0x00000108. global_size: 0x00000128
2000: global_id: 0x00000080. global_size: 0x00000128
2100: global_id: 0x00000052. global_size: 0x00000128
2200: global_id: 0x00000024. global_size: 0x00000128
2300: global_id: 0x00000124. global_size: 0x00000128
2400: global_id: 0x00000096. global_size: 0x00000128
2500: global_id: 0x00000068. global_size: 0x00000128
2600: global_id: 0x00000040. global_size: 0x00000128
2700: global_id: 0x00000012. global_size: 0x00000128
2800: global_id: 0x00000112. global_size: 0x00000128
2900: global_id: 0x00000084. global_size: 0x00000128
3000: global_id: 0x00000056. global_size: 0x00000128
3100: global_id: 0x00000028. global_size: 0x00000128
3200: global_id: 0x00000000. global_size: 0x00000128
3300: global_id: 0x00000100. global_size: 0x00000128
3400: global_id: 0x00000072. global_size: 0x00000128
3500: global_id: 0x00000044. global_size: 0x00000128
3600: global_id: 0x00000016. global_size: 0x00000128
3700: global_id: 0x00000116. global_size: 0x00000128
3800: global_id: 0x00000088. global_size: 0x00000128
3900: global_id: 0x00000060. global_size: 0x00000128
4000: global_id: 0x00000032. global_size: 0x00000128

3rd variant:
Now I modifed the kernel to get global_id and global_size and it return non-resetting (since it is no longer local id) current global_id as well as global ID which is essentially the size of array 4096. Makes sense too:

"kernel void memset(     global uint *l_global_id, global uint *l_global_size)      \n"
"{                                                                      \n"
"       l_global_id[get_global_id(0)] = get_global_id(0);               \n"
"       l_global_size[get_global_id(0)] = get_global_id(0);             \n"
"}                                                                      \n";

I am not sure why I was getting garbage in my previous example once current ID exceeds local_work_size, with the code no longer exist in my repository, all I can do for now is sigh and moving assuming there has been some code error, unless I find it back.