question on work-items?

I have a question about one of the basic concepts in OpenCL, work-items. Suppose you have an array of 1,000,000 elements and you want to execute some code on each item which is completely independent from other items. Now you can have two scenarios to do so:
1- You can have a work-item for each element, which adds up to 1,000,000 work-items. As the GPU likely would not have this number of PEs to assign to each work-item, I think some work-items will have to wait until the others are completed. Am I correct? How are work-items mapped to PEs during runtime?
2- Now suppose you want to unroll the parallel algorithm, such that each work-item deals with more than just one element. For example if total number of PEs is 100, then each work-item is responsible for processing 10,000 elements. How can I achieve this goal assuming that I don’t know the number of PEs in GPU?

I will really appreciate any kind of suggestions!

Work items in the same work group are executed parallely.

Why don’t you know the number of PEs? You can divide the number of data elements by CL_DEVICE_MAX_WORK_GROUP_SIZE.

But the question is, why do you want to “unroll”?

PE isn’t an opencl term, perhaps you mean CU, not that it is terribly important.

  1. It just runs in batches of the size the hardware can run concurrently, until they’re all done. The number it can run concurrently depends on the hardware, how many cu’s there are on the card, how many threads can run at the same time, etc.

  2. Err, write a loop? You answer your own question here.

Oops, got that wrong, for some reason I thought it was a vendor-specific term like ‘wave front’ is.

Anyway, PE cannot be queried from code and is more of an implementation detail anyway. CU can be queried, and thus used to fit an algorithm to a device.

Thanks for your replies, but I didn’t actually get my answer.
Let me ask my question in a different way: according to the specification there is no limit on the global number of work-items, but logically in any given time there should be a limit on the number of work-items that can be mapped to real hardware processing elements. In this regard, what is the maximum number of work-items that can run simultaneously on a GPU?


I answered that. “batches of the size the hardware can run concurrently”

How big this number is depends entirely on the hardware (which we don’t know what you’re using, and even if we did, don’t know enough about it to tell you accurately), and your code (which we know nothing about). So there’s no possible way to be any more specific than that.

You have to study the details of the specific hardware and vendor implementation, and your own code to determine this. Or at least run it on a given piece of hardware and see what the profiler tells you it did.

Thanks notzed,

I didn’t ask for the number of work-items in my GPU :slight_smile: of course it is hardware dependent and different for each device. I was asking a more general question about work-item assignment and scheduling. However, I think I found my answer.