how i create the memory for block or workgroup

how can i create the buffer for work group or block on the gpu so that i can execute threads by work item.

Sorry jai, I don’t understand your question. Are you asking about clCreateBuffer?

means if i have to decide size of workgroup manually,how can i do it and what is the size of work group is set by default in opencl

Ah, choosing workgroup sizes. The default size is probably implementation specific and in my experience it is not even close to optimal.

For GPUs, it should be a multiple of 32 because threads are arranged in multiples of 32. A thread warp on Nvidia is 32 threads, scheduled as either two batches of 16 threads or all 32 threads at once. AMD uses wavefronts of 64 threads in their Graphics Core Next architecture, again a multiple of 32.

The maximum size is a little harder to determine as it is limited by the maximum number of threads per workgroup as well as the maximum number of registers and maximum shared memory per compute unit/multiprocessor. That said, keeping to a multiple of 32 will give you the best performance as that will allow you to fully use the GPU. The more threads you can fit in, the better.

And a final note. Multiple workgroups can execute simultaneously on a compute unit/multiprocessor, up to some device dependent limit, hence you should try to find the workgroup size that maximises the number of threads per compute unit/multiprocessor keeping in mind that you will need multiple workgroups per compute unit/multiprocessor.

okkk but how can i do it programmatically if u tell it would help a lot,

Doing it fully automatically is tricky if not impossible. The important things you need to know about your kernel are the number of registers and the amount of shared memory that it uses. There is no standard way to get this information at run time. For AMD, you can use KernelAnalyzer to get these parameters for a range of GPUs and then just hard code those numbers into your code. For Nvidia, you can use the -cl-nv-verbose flag when building the kernel and parse the build log to determine how many registers are being used. Sadly, these values change with driver version and hardware.

With that info about your kernel hard coded, you can use the following procedure to identify the hardware:

  1. Parse the vendor/platform string looking for “amd”, “nvidia”, “intel”, etc. for each vendor you are going to tune your code for.
  2. Parse the device name strings for keywords like GPU code names. If you manage to match one of the code names then you have found enough information for AMD cards to look up the register and shared memory usage that you hard coded. No idea if this works for Intel CPUs.
  3. For Nvidia, you can use Nvidia extensions for getting the compute capability, but I don’t have experience with that.
  4. At this point, you know (hopefully) what the configuration of the hardware is and what resources your kernel needs. Now you need to work out the maximum number of threads that can fit on a single compute unit. Do this by dividing the available registers and shared memory on a compute unit by the amount of registers and shared memory required by your kernel. The smaller of these two values rounded down to a multiple of 32 (Nvidia) or 64 (AMD) gives you the maximum number of threads that can fit on a compute unit.
  5. Now this number of threads must be divided into equal groups that are smaller than the maximum allowed workgroup size returned by clGetDeviceInfo. You will also have to hard code the maximum number of threads that can execute on a compute unit as this determines how many workgroups can reside on a single compute unit. You have to find the workgroup size that maximises the total number of threads on a compute unit, i.e. workgroup size * number of workgroups that fit onto a compute unit.

If you fail to identify the device at the start then you could default to letting the OpenCL implementation choose the workgroup size.

ok thanks for detailed and useful ans, i have gpu of nvidia having 96 cores and 2 multiprocessor ,win 7 64 bit and visual studio 2010. now how can i set workgroup size so particular amount of processing can be given to that work group by some function if any pls help

You need to set the local_work_size parameter for clEnqueueNDRangeKernel. It is an int array whose elements determine how many threads are used for each dimension of the workgroup. I don’t have any code that will choose local_work_size for you, you will have to do that yourself.