how fast is it as compared to OpenMP?

sayush · October 16, 2009, 12:39pm

Has anyone compared OpenCL vs OpenMP for some distributed tasks? I just wanted to know how much will the speed-up be, if i want to code in OpenCL.

Jofo · October 17, 2009, 8:33am

Look here:
http://www.macresearch.org/opencl_episode1

It is demonstrated in the video tutorial, and its quite a notable difference (when run on GPU)

dbs2 · October 17, 2009, 11:39am

Unfortunately it’s impossible to answer that question. Here are a few parameters to consider:

If your algorithm maps well to a GPU, then you can see 10x-100x speedup
If your algorithm maps well to the vector units on a CPU and you’re not already using them, then you can see a 2-4x speedup
If the OpenCL implementation has performance bugs then you can see a slow down

So without knowing more about your algorithm there’s no way to answer that question. The first things I would ask are:

How data-parallel is your algorithm? (the more the better)
How much inter-thread synchronization do you need? (the less the better)
What is your computation-to-communication ration? (the higher the better)
Do you need double precision? (only on a few GPUs, and slower currently)
How much data do you need? (GPUs generally have <2GB of storage)

udb · October 20, 2009, 9:41am

http://developer.amd.com/gpu/ATIStreamS … nCL_6.aspx

Last section. Btw, it is a perf comparison on the same Phenom cpu; should be faster on a GPU.

dbs2 · October 21, 2009, 1:47am

On a purely data-parallel operation (such as convolution) there is no reason OpenCL on a CPU should be any slower than OpenMP on the same CPU. They should both be able to split the work into large chunks and therefore have negligible overhead. If you are seeing OpenCL running significantly slower than OpenMP on such code it is likely to be due to performance issues with the OpenCL implementation.

One thing to consider is that the local work-group size is an artificial construct on todays CPUs. By that I mean CPU cores only run one work-item (=thread) at a time*, so there will be overhead from having multiple work-items in a work-group as that has to be handled either through multiple threads (inefficient) or compiler tricks to approximate threading (complicated). GPUs physically execute multiple threads concurrently in a work-group so this is a natural concept for them. It might be worth investigating using a local size of 1 and a global size of 2-4 * number of cores to see if you get better performance. That configuration should give you the best performance on current CPUs.

*SMT does run multiple threads on a core at once, but the OS sees them as separate threads which implies much more costly synchronization than the work-items on a GPU.

udb · October 21, 2009, 11:24am

Ultimately perf depends on the particular (OCL or OMP) implementation.

On the AMD implementation for CPUs:

work-items to OS-thread mapping is many-to-one.
use large work-group size (definitely greater than 1) to get good performance.
use vectorized ops to get faster performance than OMP.