Great performance loss on conditionals

sansan · April 2, 2010, 6:31am

Hello everybody!

Though just starting to get into OpenCL realm, I have already noticed a strange effect.
Namely, my OpenCL code works very efficiently, until I add a very simple “if” statement that chooses the largest of two floats.
My estimation shows that it consumes about 2/3 of the algorithm’s total time.
I have tried various ways to avoid using “if”:

calling max() and fmax(),
using a formula based on sign(x - y)

but always it’s too slow

Due to this, the same algo is x3 times more efficient on AMD Athlon™ II X4 620 2.61 GHz than on NVIDIA GeForce 9600 GT I’m using as OpenCL hardware. So the idea of GPU-based computing seems quite “unripe”…

Is there any general recommendations how to avoid dramatic performance loss on conditional statements? Or is it unavoidable?

coleb · April 2, 2010, 7:59am

What you’re seeing is quite likely on GPU hardware. It’s not designed for code with complex branching and decision making. What seems weird is the fact you’re seeing it on such a small branch which should use instruction predication just fine.

Are you sure you’re fully utilizing the GPU hardware? What is your global size and your work group size? Feel free to post some example code here as well.

dbs2 · April 4, 2010, 4:52am

I agree with coleb. It seems like you’re running into problems with instruction divergence. (google opencl instruction divergence) However, it also seems strange that this should be so much of a problem since simple branches should be automatically predicated. One thing you can try is putting in a barrier after the conditional so the work-items resync at that point to avoid further instruction divergence in the execution. This might be particularly useful if you have a conditional branch at the beginning followed by a lot of computation.

sansan · April 4, 2010, 10:46pm

You were right!!!
The problem was that I had not been using GPU in a right way, i.e. the work group size was zero.
Correcting this by passing non-zero local_work_size to clEnqueNDRangeKernel() call (16 x 16 in my case) gave me what I needed.
Now GeForce 9600 GT is faster than CPU (though not so much).

Thanks a lot for a good idea!
It was not that obvious from the OpenCL manual how to approach the problem.

dbs2 · April 5, 2010, 7:53am

That sounds a bit strange. A local work-group size of 0 should give an error. (NULL should be fine, as the driver will attempt to pick a good one.) I’d still suggest investigating why the local work-group size should have such an impact on the if statement.

sansan · April 6, 2010, 2:42am

I am sorry for my poor English!
I meant exactly NULL pointer passed as local_work_size argument value.
It seems that current OpenCL implementation by NVIDIA is not very good at splitting the global work-items into work-groups automatically.

Slightly off-topic point (addressed to the forum administrators), is not it worth creating a sort of “OpenCL troubleshooting” sub-forum or just a single “always on top” thread in order to collect OpenCL programming mistakes and resolutions for them in a single place?