OpenCL Intel x86_64 code generation

I recently got an ATI 5870, and recoded my neural network run on it. The time went from 200 sec. to 65 sec. Just for fun, I changed the device to CPU, and the time went to 55 sec. I am very interested in finding out what OpenCL is doing to get this performance on my CPU. I am particularly interested in the threading model that it is using. Is this TBB? Pthreads? Where can I find out?

Also, I have code that takes an integer array, and uses the int4 to grab 4 values at a time. Is this using SSE2? Again, where can I find out?

First, what CPU are you running? Every processing unit has its own low level compiler for OpenCL kernel code. Some do automatic vectorization and SSE instructions, some do not.

TBB and pthreads are high-level threading abstractions and are most probably not being used. I’m not sure what low-level threading models are being used though…

as for SSE2 or other vectorized instructions, some OpenCL SDKs have offline compilers (Intel OpenCL Offline Compiler and ATI Stream Kernel Analyzer, for example) and you can look into the assembly instructions being used (

Depending on your code, some algorithms work better on CPU than GPU. If you are using extensive branching operations, for example, a CPU may be much better suited.

ah, you specified the CPU in the thread topic :stuck_out_tongue:

well Intel’s implementation most probably vectorizes code automatically and will be using extensive SSE instructions. To make sure of this, play with attribute((vec_type_hint(<typen>))) and see how it affects the assembly instructions

The code in question is simple loop, with no branches, it is summing an array of 200,000 by 100, by groups of 100. This area of code is executed a lot, that is why I was thinking about loop unrolling and vectorization, as well as multi-threading.

I didn’t explicitly mention it, I am using an Intel 5520 Xeon processor.

Using OpenMP, I’ve been able to get very close to the performance of OpenCL on my CPU, within 10% of the OpenCL performance. I am going to try adding SSE intrinsics to the mix, and see if that establishes performance parity.

You should try and look at the generated assembly code for each version you have – it is usually quite obvious when SSE is being used (look for the XMM registers). OpenMP in simple cases, depending on which implementation you are using, should be able to use SSE. If you inject your own SSE intrinsics, then it may not be able to do that and you could see performance get worse. If you are skilled at using SSE then you ought to be able to get best performance on your hardware using them directly… however on other processors your code will not be able to leverage any special instructions they introduce (e.g. Sandy Bridge’s AVX), whereas OpenML and OpenCL will.

More important than these factors, however, is that your code sounds like it is very simple computation over a large amount of data. In that case your are likely to be memory bound and it doesn’t really matter how you implement your loop as long as your memory accesses are done right.