Has anyone compared OpenCL vs OpenMP for some distributed tasks? I just wanted to know how much will the speed-up be, if i want to code in OpenCL.
It is demonstrated in the video tutorial, and its quite a notable difference (when run on GPU)
Unfortunately it’s impossible to answer that question. Here are a few parameters to consider:
- If your algorithm maps well to a GPU, then you can see 10x-100x speedup
- If your algorithm maps well to the vector units on a CPU and you’re not already using them, then you can see a 2-4x speedup
- If the OpenCL implementation has performance bugs then you can see a slow down
So without knowing more about your algorithm there’s no way to answer that question. The first things I would ask are:
- How data-parallel is your algorithm? (the more the better)
- How much inter-thread synchronization do you need? (the less the better)
- What is your computation-to-communication ration? (the higher the better)
- Do you need double precision? (only on a few GPUs, and slower currently)
- How much data do you need? (GPUs generally have <2GB of storage)
Last section. Btw, it is a perf comparison on the same Phenom cpu; should be faster on a GPU.
On a purely data-parallel operation (such as convolution) there is no reason OpenCL on a CPU should be any slower than OpenMP on the same CPU. They should both be able to split the work into large chunks and therefore have negligible overhead. If you are seeing OpenCL running significantly slower than OpenMP on such code it is likely to be due to performance issues with the OpenCL implementation.
One thing to consider is that the local work-group size is an artificial construct on todays CPUs. By that I mean CPU cores only run one work-item (=thread) at a time*, so there will be overhead from having multiple work-items in a work-group as that has to be handled either through multiple threads (inefficient) or compiler tricks to approximate threading (complicated). GPUs physically execute multiple threads concurrently in a work-group so this is a natural concept for them. It might be worth investigating using a local size of 1 and a global size of 2-4 * number of cores to see if you get better performance. That configuration should give you the best performance on current CPUs.
*SMT does run multiple threads on a core at once, but the OS sees them as separate threads which implies much more costly synchronization than the work-items on a GPU.
Ultimately perf depends on the particular (OCL or OMP) implementation.
On the AMD implementation for CPUs:
- work-items to OS-thread mapping is many-to-one.
- use large work-group size (definitely greater than 1) to get good performance.
- use vectorized ops to get faster performance than OMP.