when to use get_global id and get_local id in opencl?

when to use get_global id and get_local id in opencl?


difference between global_work size and local_work size?

In the picture above you can see how id works in one-dimensional.
Get it? You can use them both in most cases.

Is it correct to assume that the kernel will be run global_work_size times regardless of local worksizes and etc?

Yes, the kernel will run globalWorkSize times.
localWorkSize items will be executed in parallel.

globalWorkSize = localWorkSize * numberOfGroups;

but many times in kernel it is written that
get_global_id(0);it means that only one core is using of particular gpu so it means that it is not parallel program if application is using only that kernel ? pls clarify this

get_global_id(0) returns the global ID of the workitem in the first dimension. For 1D problems, this is enough for the algorithm to work. The kernel will still be launched in parallel many times.

Take for example:
__kernel array_sum (__global float* A, __global float* B, __global float* C)
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
Assume A, B & C are arrays of equal size.
This kernel sums in parallel the arrays A & B into C. each workitem sums one index.
The application launchs this kernel with 1D global size equal to the array size. The divide to local sizes in this case is not relevant to the algorithm, and can be left to the HW. If there will be enough groups, the device (CPU or GPU) will utilize all available cores.

but global_id(0) means first work item in 1 Dimension ,so how other work items can execute for the same kernel,while giving global_id is 0. while it should be increased using ++ for next work item? pls clarify…

if you are in the first work item get_global_id(0) will return 0. If you are in the second work item get_global_id(0) will return 1 and so on. The 0 stand for the dimension.

it means that values are automatically assigned to cores of gpu ? It means kernel automatically runs on next core whose value is --> previous core’s id + 1?
is it right?

Yes you got it.

I think this picture is better than the last one i posted and will show how it works:

One workitem is done on one processor. Every workitem get’s a unique id(get_global_id()). The items inside a workgroup are executed in parallel. And this is done for all items.

You could use http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/calling-the-kernel/ this page for clarification. Its quite helpful.

if we have ,say n+1 input elements and global work size is n means number of cores in gpu is n then how it will execute means n element will execute and after that in 2nd cycle
last element will be processed by any freed core…am i right? or something else will happen?

i have not heard about 0D ,heard only about 1D,2Dand 3D
pls clarify…
if id number starting from 5 means id =5 then array which we have declared fill the value from 5 th element not from 0th element?
and i am not making any work group?


get_local_id and get_group_size are only needed if you are using shared local memory. For many algorithms that do not, you can ignore them and use get_global_id.

In order to process large global work sizes, they are divided up into work groups that you can think of as being run sequentially. Each work group has a size. The local id is the index within the group, the group number is the count, the group size is the size.

Kernels are 1D, 2D, or 3D. Use get_global_id(0) to get the first dimension (C counts starting at 0; there is no 0D). Use get_global_id(1) for the second dimension when doing 2D kernels, and get_global_id(2) for the third dimension when doing 3D kernels.

so this value will be pass to array id ?
__kernel array_sum (__global float* A, __global float* B, __global float* C)
int idx = get_global_id(0);
C[idx] = A[idx] + B[idx];
if id is 5 then my array will start storing the element from 5 th element so initial 4 memory will waste???

I don’t understand what you mean by “wasted”.

If you set your global work size to, say, 10, then (in parallel) 10 GPU threads will run, each getting a different result when they call get_global_id. Specifically, one will get 0, one will get 1, etc., up to 9. In your example kernel, this will calculate C=A+B for 10 array elements in parallel.

When you re-code from single-threaded CPU code, you often replace your inner loop (or the two inner loops in the 2D case) with a kernel executed over some global work size. The loop (or loops) go away and the OpenCL runtime take care of executing your kernel across the global work size. Think of get_global_id as “what is my loop index” from a sequential version of your algorithm.

i am saying that idx =5
means a[5] is the 5 th storage location in array then 6th then7th then 8th then9th then what will happen because i have 10 elements but in array it is storing from 1[5]? because my array will start from a[0] to a[9] in kernel code.and is it also same for the B[idx]…

ok but how can i think of get_global_id as loop index because global id is unique across gpu

Please study the basics of global IDs from an OpenCL tutorial or book.

You can say that idx = 5 as much as you want, but it’s not. It’s 0 to the global size minus one, runs across that many threads (work items). If your array is 10 elements and you set the global size to 10, idx will be 0, 1, 2, …, 8, 9 across 10 work items and each element of your array will be processed.

If you want to start at 5 you can by using the offset parameter (starting in OpenCL 1.1) but that doesn’t seem to be what you’re looking for here.