Vectorization on various opencl implementations

I recently made the following change to an OpenCL code of mine, and the code gained 40% speed when running on Intel’s OpenCL.

https://github.com/fangq/mcxcl/commit/4bcfebdf37fb36fba56fd9bb46c12771e21a64b1#diff-3e7bff849d973dfbbbf2ff6591ee8862L216

the above change involves manual vectorization of consecutive lines for a float4 short vector.

My questions are:

  1. Intel’s OpenCL claims to do “auto-vectorization”. However, in my case, it appears that it was not effective, and I had to do it manually. Is there a flag that I have to set in order for Intel OCL to group vector operations?

  2. how about AMD’s OpenCL and NVIDIA’s opencl? do they vectorize such operations automatically? if not, is there a benefit doing manual vectorization on the GPU?

  3. My code runs 3x faster on Intel’s OCL than AMD’s OCL on the same CPU (Intel i7-4770K, Intel OCL produced 612 photon/ms, while AMD’s OCL produced only 205 photon/ms). Any reason why such difference? my speed comparisons are shown in the bellow spreadsheet.

https://docs.google.com/spreadsheets/d/1QRILShH95S53SqlLabDxXB98JZ1ZzZ9TcZngjP6W3B8/edit#gid=0

  1. I made additional vectorization (mostly in the logistic_step function), which gained another 15% speed up with Intel OCL, but it failed to run on NVIDIA’s OCL.

https://github.com/fangq/mcxcl/commit/fc2d97830a380c73a427eab84a64aca3f3531dcc

can someone tell me what I did wrong with this change? (I cast a float[5] pointer to a float8 pointer in logistic_step(), but I did not read/write beyond the 5-float boundary)

Try experimenting with compiler options. (Appendix B) https://software.intel.com/sites/landingpage/opencl/user-guide/index.htm?wapkw=(Apple)

  1. how about AMD’s OpenCL and NVIDIA’s opencl? do they vectorize such operations automatically? if not, is there a benefit doing manual vectorization on the GPU?

Their modern GPUs are scalar and there should not be impact at all. I.e., CUDA C doesn’t even have vector types. AMD docs say “vectorization is not required nor desirable”, which actually contradicts the fact that they have vector registers.

  1. My code runs 3x faster on Intel’s OCL than AMD’s OCL on the same CPU (Intel i7-4770K, Intel OCL produced 612 photon/ms, while AMD’s OCL produced only 205 photon/ms). Any reason why such difference? my speed comparisons are shown in the bellow spreadsheet.

Their compiler is just better. There are cases when Intel OCL is faster even on AMD CPU (when it works at all). :smiley: