confusion compute unit/stream proc/warp/work gp/threads

ok i am just getting into openCL. and there is a lot of confusion i am having wrt to translating hardware and software groups.

i have a gtx 285 and it is having 240 cuda streaming processors. But when i run the device query program in nvidia gpu computing sdk for openCL it shows


1.So why is openCL showing compute units as 30 when my streaming processors are 240? I guess a compute unit is not a streaming processor? So then what is a compute unit?

  1. Now the work group max size is 512, so it means it can have a max of 512 threads/wrk-items? But i can have any number of work groups of any dimension?

  2. Work group is obviously the logical abstraction, so does a work group span multiple stream processors? eg a work group having more than warp size of work-items

  3. what is the logic of 16kb of local memory asigned to a work group. if work group is logical and not hardware how can it get a fast 16kb local memory from a streaming processor. (this one drives me nuts)

  4. can 2 cpu threads have their own kernels, doing the same task in parallel? obviously their data is local to them and so no need to synch them.

  5. can i pre allocate a 1d array of size n on a kernel on program-load and use this same kernel but different instance for each cpu thread?

i know too many questions, but a noobs gotta know all this before getting hands dirty or stick to sse

  1. A streaming multiprocessor (compute unit) has 8 streaming processors. Therefore 30*8=240.
  2. Yes. As long as the product of the work-group dimensions are <= 512 (and the kernel doesn’t need too many registers)
  3. Yes, it can have more than a warp, but no, work-groups do not span multiple compute units. A warp uses multiple stream processors already.
  4. A work-group all executes together on the same compute-unit. The compute-unit has physically only 16kB of memory, so for whatever size work-group you choose, they can only access 16kB of shared memory.
  5. Yes, if you have a global-size of 2 and a work-group size of 1, you will get one thread on each CPU.
  6. Not sure I really follow here. If you run a kernel on the CPU device you will run it across all CPU cores just the same way as a kernel on the GPU runs across all GPU stream processors at once. You only need one kernel and it will be run data-parallel on all cores. This is why you only see 1 CPU device on machines with multiple cores.

A ‘compute unit’ is a lump of hardware that executes ‘work groups’. A work group is, as you know, a collection of ‘work items’. On NVIDIAS hardware each work item is linked to a ‘CUDA thread’. A thread executes on a streaming processor (SP), and the collection of SPs that handles all the threads for a work group is called a ‘Streaming Multiprocessor’ (SM). This is NVIDIish for ‘Compute Unit’. Remember that several threads can be assigned to the same SPs and a single SM can therefore handle work groups larger than the number of SPs on each SM. Or rather, each SM can hold several work groups and at each clock cycle (or every fourth or whatever) selects one of the work groups it holds, then a set of work items (threads) from the group and finally let the SPs it has execute instructions for these selected work item. The NVIDIA programming guide describes this rather elegantly in terms of ‘warps’.

A quick division gives that, in your case, there are #SPs/#SMs = <from your post>/<CL_DEVICE_MAX_COMPUTE_UNITS> = 240/30 = 8 SPs per SM, which can be verified in the specifications for your card.

It means that you can have at most 512 work items per work group, organized according to the limits imposed by CL_DEVICE_MAX_WORK_ITEM_SIZES. So in your case you can have work groups that are 1x512 or 2x128 or 4x8x16, but not 8616 (>512), not 1x1x512(Z-dim>64). The global work size, a multiple of your chosen group size, can be as large as your problem requires. It least that is the intention. I believe NVIDIA has a limit on that as well, can someone confirm this?

This question is a bit oddly formulated, and the answer is hinted at in my answer to 1. above. The quick answer is no. A work group gets assiged to a SM when dispatched and stay on that SM until all work items have finished. The oddity is the second part of your question. A work group can certanly contain more work items than the warp size. The effect is that, as the hardware steps one warp at the time, threads from different warps will progress differently, or out-of-synch, through the instruction stream (.cl-source). This is one of the reasions for the barrier() instrinsic.

A side-effect of the fixed assignment of a work group to a specific SM is that local memory can be physically implemented inside the SM. Each work group is associated with a fixed amount of resources (registers, local memory) required for executing that group. When the kernel is launched, each SM accepts as many work groups as the hardware has resources for and DEDICATE these resoureces to the work group. Therefore, each work group gets access to a sub-set of the 16k high performance (SRAm?) local memory available on its housing SM.

This I can not answer. See this thread for a discussion of what belive to be the same question.

This I don’t understand either. Arrays are not allocated on kernels, but in ‘contexts’, and in the end on ‘devices’. You can launch the same kernel twice giving it the same buffer as parameter, but on currend NVIDIA hardware it will just run the kernels serially, or twice in ths case as it is the same kernel. On Fermi-like architectures where several kernel invocations can run concurrently you would probably get undefined behavior.

I don’t think I understand your question. What are you trying to achive?

first of all a big thanks to dbs2 & ibbles. a lot of my doubts are now cleared. it feels like i can breath some air. :mrgreen:

-so basically a compute unit is a multi processor than contains 8 stream processors. On these stream processors you can have more than one thread running

-a logical work group binds to a hardware multi processor and never spans more than one multi processor even if it has many threads > stream processor

-work grp can have more threads than warps, but internally the multiprocessor works at one warp at a time in a clock cycle?

about question 5+6, what i wanted to know was if i can run concurrent kernels of the same type on the gpu.
@ibbles=> what i am trying to achieve is this. my project is connected to mutiple cameras. each camera is processed by a different cpu thread. now sometimes i need to do some ops on the frame, so i send it to gpu. but the same op is also done by the other thread on a different camera frame. So my question was if a kernel can concurrently run on the gpu. and i guess your answer was no.

thanks again!

In your case, as with most NVIDIA GPUs, there are 8 stream processors on a multi processor. But there is nothing saying that there will always be eight of them, or even stream processors at all. They are an implementation detail of NVIDIAS OpenCL implementation. It may work differently on ATI cards or ordinary processors.

Yes, at least for the time being.