doubts with work items and groups

Hello, I´m new in the forum and I would like to ask something: I would like to know if the max amount of work items is the one that says max number of the dimension, in my case 512x512x64 or if in fact that is the max number of work items per work group and the max amount of work items would be the max number of the dimension X the max number of workgroups.
If it is the first case, how can I manage a large matrix? and if it is the second case, how do I specify the values becuase I have only the size_t global_item_size and size_t local_item_size?
Thanks in advance, I hope it´s clear because my English is not so good!!


It’s fairly simple:
MAX_WORK_ITEM_SIZES[] is the maximum size any dimension can be. i.e. 512x512x64 would mean x must be <= 512, y<= 512, and so on.
MAX_WORK_GROUP_SIZE is the maximum number of work items that can execute in a single workgroup. i.e. x+y+z <= MAX_WORK_GROUP_SIZE

These are only the limits of the ‘local work size’ values which is the hardware limit of a single GPU core: the global work size can always be much larger, and it just runs multiple instances either on multiple GPU cores, concurrently on the same core, or in batches until they’re all done.

All problems must be broken into these limits, so that is the challenge. But without doing this you’re just using a serial processor, so you don’t get the performance either.

Different hardware often has different characterstics: it’s often faster to use 64 or 128 work-items vs 512 say because then the problem runs on more cores concurrently and thus executes faster.

thanks for answering!! but then I have a doubt, in clEnqueueNDRangeKernel() there are two parameters which are const size_t global_work_size and const size_t local_work_size, for example if I determine the first=512 (suppose I need one dimension) and the second=32, then the amount of work groups would be 16? and the work group size would be 32? and how many global work items will there be?
Thanks again!!


There will be 512 global work items. Each group of 32 local work items will get a unique global id, which satisfy:
localid = global_id % local work size

It’s easiest to think of the local size as the ‘batch’ size, and the global size as the count of all items. global size is not the number of batches.

Thank you again!!! And regarding execution, I have an nvidia card (it is a 9400/ION), is it correct that each work group is executed by only one CUDA core (in test deviceQuery it says that it has 2*8=16 CUDA cores) but each CUDA core may execute more than 1 work group as there is no limits for the amount of them?
thank you very much for paying attention to my questions which I guess are very basic!!


You need to check the vendor documentation to verify the details once they start using their own jargon. See below. Unfortunately the nvidia documentation is really cuda stuff, so it’s full of the cuda jargon and there isn’t a 1:1 mapping of similar terms (e.g. blocks, cores, don’t map to work-groups or processing units).

You also need to find the specifics of the hardware device in question to determine how many concurrent workgroups can execute on a given processor. Although a given core can execute an infinite number of kernels per the programming model, if they don’t fit at once they are executed in series. This depends on the hardware of course, but also the complexity of your kernel. Also see below …

e.g. starting with nvidia’s “OpenCL Programming Guide” (i have version 2.3) …
– Section 2.1 “CUDA Architecture”
2.1.1 - explains how the “cuda cores” map to “opencl compute unit”, via a “stream multiprocessor”
– Appendix A
A.1 - explains that a “stream multiprocessor” is 8 cuda cores, and your device has 2.
A.1.1 - lists the device capabilities such as register counts; which limit the number of concurrent workgroups. Your device is compute capability 1.1 so it has the limits listed there.

And again back to chapter 2.1 - section 2.1.2 explains how the device limits affect the number of work-groups that can execute concurrently.

See, it’s all there in the manual.

Thanks once again!! I´ll read those sections!!


I´ve been reading and as usually some questions are answered and others rise!! For example, it is not clear, at least for me, what executes the work-items, as it seems to be the CUDA cores, but there are too few of them to be correct. the text says: ‘A multiprocessor executes a CUDA thread for each OpenCL work-item and a thread block for each OpenCL work-group.’ At least it does not specify what a CUDA thread is.
But now I have another question: how can I measure the whole execution time, I mean not just the kernels, but since it starts until it ends, both for GPU (with OpenCL) and CPU (doing a version of the program in common C/C++)? I have tried using the profiling options like this:

ret = clEnqueueNDRangeKernel(command_queue, kernel, 1, NULL, 
            global_item_size, local_item_size, 0, NULL, &event);
	clWaitForEvents(1, &event);
	clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END,sizeof(cl_ulong), &end, NULL);
	clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START,sizeof(cl_ulong), &start, NULL);
	total = (double)(end - start)/1e9;
	tiempo += total;
	printf("total= %f

but I guess this only measures kernel´s execution time.
Thanks in advance!!


It’s easier just not to care! Because different devices will be different and you’ve got what you’ve got …

But, you can use MAX_COMPUTE units to tune the code to match the physical characteristics of the device: it should be the number of physical cores that can execute 1 workgroup. It doesn’t count how many workgroups each physical core that may be able to execute in parallel, so if you use this number you still have to multiply it by a ‘fudge factor’ which depends on the kernel.

There are also kernel queries available after you compile the code for the device: clGetKernelWorkGroupInfo().

But now I have another question: how can I measure the whole execution time, I mean not just the kernels, but since it starts until it ends, both for GPU (with OpenCL) and CPU (doing a version of the program in common C/C++)? I have tried using the profiling options like this:

but I guess this only measures kernel´s execution time.

Personally, I just use gettimeofday() or equivalent for that - the profiling stuff is just so much extra dummy code to muck around with once you have more than a couple of routines. Or sometimes easier- just use a profiler, the nvidia one is ok enough most of the time.

Just add more events if you want to time more parts, the timestamps can be used between events. e.g. to time a round-trip buffers-to-device, some kernels, device-to-buffers, just add events to the first WriteBuffer, and the last ReadBuffer, wait for everything to be done and get the ENQEUE time from the first and END time from the last.

I’m usually just interested in the kernel time because the rest of it is not something I can alter without buying a new card or computer, and/or in many cases it can be hidden by asynchronous memory transfers and is thus irrelevant.

Thanks again!! I was using gettimeofday() but it didn´t seem very precise, probably I´m wrong. What is bad for me is that if been trying to implement a version of conjugate gradient method, but if the measuring of time is correct, the GPU version is not faster and even a bit slower than a CPU version made in C, but I will try to improve it if I can!!