Workgroups and global IDs

I have a bit of a beginners question on how to set the number of work items correctly for my program.

My program does some calculations on a grid, and my parallel calculation runs column wise. I need all the calculations on a column to finish before I start the next, so I want a single work group with a barrier, and every work item then processes a single row, with a barrier after it finishes working on each column. I cannot use more than one work group as each cell depends on all values in the previously calculated columns in the grid.

I was then planning on using the global_id (get_global_id(0)) to basically set the row number for each work item, but if my program has a very large number of rows (e.g. 250k ,or 2.5m) I’m not sure that I can do this.

Can I use the global_id in this way, and if so how would I call clEnqueueNDRangeKernel?
Does my program structure sound sensible?

Many thanks

Do you understand that the performance with single workgroup willl be very poor? The kernel will utilize just one compute unit. While AMD 6950 has 22 CUs, GF114 has 8 CUs. Even Intel Core i2500 has 4 cores.

I do understand thanks, but as I need the synchronisation I don’t believe I can split my work to use more than 1 - it’s a dynamic programming problem, and as such each row and column depend on all the previous rows and columns.


What about organizing work in the following way:

  1. The kernel generates one value for the single column using values from previous columns generated by previous kernel runs.
  2. Global work size (number of work items) is the row count. 250k is much more than enough to make device busy.
  3. The amount of kernel runs (enqeueNDRange) is equal to the number of columns.

But wait… Do you mean that the value for the specific cell is dependent not only on the values in the previous columns but on some values in the same column?

The value is only dependent upon previous columns

This sounds like an interesting suggestion - enqueue guarantees that the commands are executed in order, so it should work.

I knocked up a quick version using the method you suggested. The results are quite interesting as if I run this OpenCL code on CPU (max workgroup size = 1) it’s about 25% faster than a software equivalent; if I run it on my GPU (max workgroup size = 320) it’s about 50% slower.

My approach with barriers on the other hand runs about 80% faster on CPU (OpenCL) and 15% faster on GPU, relative to my pure software benchmark.

Strange results, as in theory the GPU should blitz this, but I guess the memory xfers are too expensive.

I don’t understand how the workload looks like. You’ve described it as a grid. That’s all I get. What are the columns? What are the rows? Can you execute all items in a column in parallel? And once that is finished you can execute the next column?

What is the typical height and width of that grid?

What is the global work size?

On my test I was running on a grid about 1000 rows * 10k columns.
Every item in a column can be executed in parallel. So i have two approaches at the moment:

  1. single work group, every work item is assigned to a row (so the entire work group works on a column). I then use a barrier to block before the work item moves onto the next column.
    I call enqueue numCols/workGroupSize times, and the global and local parameters in enqueue are just set to workGroupSize (320 on my GPU for this kernel).

  2. suggestion as per Maxim, where the kernel works on a single column. I use multiple work groups to maximise the number of work items.
    I call enqueue numCols times, and the command queuing thus ensures that every column is processed in sequence.
    Perhaps if I rotated the grid such that there were more rows than columns this approach would be more performant.

as per your suggestion I set the global number to the number of rows in the grid (and NULL for the next param to let the implementation decide upon the best way to allocate work between groups).

  1. 1000 work items most probably is not enough to fully load high-end GPUs.
  2. Make global work-size to be multiple of 128. So if it is 10,000, then change it to 10,112. In fact, you can play here with that number, try changing it to 64 or 256. The reason is that some implementations are VERY bad in defining local worksize.

I should say that my “GPU” is a 9400M part, running on Snow Leopard, so not exactly state of the art.

Thanks for the suggestion - I will give that a try

And one more advice: If you need to enqeue a bunch of kernels you better do subsequent enqeueNDRange and only then call flush (or finish).

There’s not a lot of parallelism in that problem if you can only execute 1000 work-items at a time. Calling clEnqueueNDRangeKernel() 10,000 times to execute 1000 work-items each time is not going to be very efficient.

I would try to find ways to expose more parallelism in the problem.