I did a simple parallel programming project to do same thing as in cuda. I know there is a amd maintained cuda to opencl conversion project in its public repository of rocm, but i wanted to it manually by myself to enforce my understanding.
Here is the cuda kernel function, lets say particular gtx card 128 max block size at most then I do something like:
add<<<(N+127)/128, 128>>> (dev_a, dev_b, dev_c);
With that I can print out the values from each of the thread.
If I exceed disregarding the max block size i.e. add<<<1, 1024>>>
when block size is 128
then i found nothing returns only 0
even with the threads that fall withing the 128
limit.
I did something similar in open cl:
kernel void k_get_local_id(global uint *l_local_id, global uint *l_local_size)
{
l_local_id[get_local_id(0)] = get_local_id(0);
l_local_size[get_local_id(0)] = get_local_size(0);
}
so that l_local_id
will return the thread ID for current work-group and l_local_size
will return the size of workgroup.
With vega56 returning max global workgroup size of 256
, I deliberately set global workitem as 2048
an local as 128
twice less than vega56’s.
No. of platforms available: 1
Platform 0: 531901744.
No. of devices available: 1
[0000] CL_DEVICE_NAME: gfx900.
[0000] CL_DEVICE_VENDOR: Advanced Micro Devices, Inc..
[0000] CL_DEVICE_VERSION: OpenCL 2.0 .
[0000] CL_DRIVER_VERSION: 3098.0 (HSA1.1,LC).
[0000] CL_DEVICE_GLOBAL_MEM_SIZE: 4278190080 (ff000000).
[0000] CL_DEVICE_LOCAL_MEM_SIZE: 00065536 (00010000).
[-030] CL_DEVICE_LOCAL_MEM_TYPE: 00 (00).
[0000] CL_DEVICE_MAX_CLOCK_FREQUENCY: 1622 (0656).
[0000] CL_DEVICE_MAX_COMPUTE_UNITS: 0056 (0038).
[0000] CL_DEVICE_MAX_WORK_GROUP_SIZE: 0256 (0100).
[0000] CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 0003 (0003).
[-030] CL_DEVICE_MAX_WORK_ITEM_SIZES: 542613872 (2057a170).
[-030] CL_DEVICE_TYPE: 00 (00).
#define G_NWITEMS 2048
#define L_NWITEMS 128
here are the rest of the relevant code although partial:
...
cl_uint *int_global_id, *int_global_size;
int_global_id = (cl_uint *) clEnqueueMapBuffer( queue, global_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
int_global_size = (cl_uint *) clEnqueueMapBuffer( queue, global_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
...
size_t global_work_size = G_NWITEMS;
size_t local_work_size = L_NWITEMS;
clSetKernelArg(kernel2, 0, sizeof(local_id_buffer), (void*) &local_id_buffer);
clSetKernelArg(kernel2, 1, sizeof(local_size_buffer), (void*) &local_size_buffer);
clEnqueueNDRangeKernel( queue, kernel2, 1, NULL, &global_work_size, &local_work_size, 0, NULL, NULL);
clFinish( queue );
...
cl_uint *int_local_id, *int_local_size;
int_local_id = (cl_uint *) clEnqueueMapBuffer( queue, local_id_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
int_local_size = (cl_uint *) clEnqueueMapBuffer( queue, local_size_buffer, CL_TRUE, CL_MAP_READ, 0, G_NWITEMS * sizeof(cl_uint), 0, NULL, NULL, NULL );
...
inc =32;
for(i=0; i < G_NWITEMS; i+=inc)
{
printf("\n%2d: local_id: 0x%08x. local_size: 0x%08x", i, int_local_id[i], int_local_size[i]);
}
printf("\n");
...
I am not sure what I did wrong but the resulting output will printout the values (local thread ID and local workgroup size) for first group only and not the subsequent ones.
0: local_id: 0x00000000. local_size: 0x00000080
32: local_id: 0x00000020. local_size: 0x00000080
64: local_id: 0x00000040. local_size: 0x00000080
96: local_id: 0x00000060. local_size: 0x00000080
128: local_id: 0xbebebebe. local_size: 0xbebebebe
160: local_id: 0xbebebebe. local_size: 0xbebebebe
192: local_id: 0xbebebebe. local_size: 0xbebebebe
224: local_id: 0xbebebebe. local_size: 0xbebebebe