I’m solving a 2D problem where I just need to process work-items for i>j (well I could process all work-items, but the result of the [i,j] item is guaranteed to be the same as the [j,i] item, so it’s a waste of resources to compute symmetrical items).
What would be the most efficient way of doing it?
Just make work-items return without doing anything when j>=i?
Or perhaps enqueuing the kernel several times with different global IDs (one invocation per row, so that each row only defines work-items with i>j)?
Or doing this: http://stackoverflow.com/questions/24021305/opencl-efficient-way-to-group-a-lower-triangular-matrix ?
(Btw, maybe this URL advice was written by the same Dithermaster as in this forum).
The URL advice somehow scares me. Wouldn’t it be already efficient if I follow my first idea above (ie: just make work-items return when j>=i)? I mean, I tend to believe that when a work-item returns, the OpenCL runtime will start execution for a new work-item as soon as a compute unit is capable of starting a new work-item, so maybe it’s efficient too, wouldn’t it?
Thanks!