Work-item cooperation possible?

Hi *,

it would be very nice if someone could give me a short advice on this one:

  • I’ve written a kernel that operates in parallel on the same chunk of data (say A), however each work-item has lots of unique private memory for itself (B_i).
  • Each work-item is computing some matrix multiplications using its private data (B_i) and the common data (A).
    So, in effect they are working in parallel on the same (large) A, however each work-item is doing sequential matrix multiplications with different numbers (B_i) on A.

When B_i (private mem per work-item) gets too large, I have register spilling, however I need every work-item to carry this amount of memory, as the private data (B_i) is different for each work-item.

So my questions is:

  • can I give the work-items within a workgroup different tasks? This would mean that item 1 is doing its original job to process A, and all other work-items wait until item 1 must do a matrix multiplication, however, items 2-256 do this multiplication instead.
  • item 1 would be the master and the others would only exist to compute matrix multiplications
  • problem is, this is severe branching and I can’t imagine this is efficient, as they are all part of the same wavefront.

How can I tell some “master kernels” to do job A and “call” other kernels to do some minor work for them? I know that there is no sync among different workgroups, so is there any other way?

Thanks for reading!


I don’t get it. What is that master item supposed to do? Where do B_i’s come from? Can’t you just assign job for each workgroup on the host side and call the kernel that performs those multiplications?

Alright, I’m training an LSTM-network (recurrent long-short-term-memory net) via PSO (particle swarm optimization). All particles are launched in parallel (as opencl kernels) but they process the same time series data. However, each particle is testing a different LSTM-Network, hence each particle (= each kernel) needs lots of pricate memory for itself.
If I just use the GPU for the matrix multiplications, I would need to to write to gpu/launch kernels/read from gpu as often as I have time series values. I guess that’s too much overhead. So I was thinking of a hybrid approach: work item 1 is one LSTM-Network and processes the time series data and the other work items do matrix multiplications for item 1. Is that possible?

Simply split the computation into two kernels. First: kernelComputeCoefficientsNUM_OF_NETWORKS * NUM_OF_NEURONS, the second: kernelSimulateNetworksNUM_OF_NETWORKS. You have to make sure both run longer than a couple of ms, but this is overall pretty efficient.

Your way should work with simple IF statement though.

void simulation{
for (int i = get_global_id(0); i < get_global_id(0) + JOBS_PER_WORK_ITEM; ++i){//Each workitem should perform this on multiple entries for barrier to work
//Do stuff
work_group_barrier(local|global memory);
if (get_local_id(0) == 0){
//Simulate stuff

It should not be TOO disastrous divergence-wise(ie, at least 75% of running cores in case of AMD is not too awful, and it will be even better for NVIDIA) , but GPU occupancy can be low if the simulation part requires a lot of registers.

Thanks a lot! I’ll try it this way