Higher register usage after migration from cuda


I’ve just migrated my program from cuda to opencl. It involved a bit work to change all the host code, like device initialization, memory allocation, kernel execution etc.

For the device code (kernels) changes were very small hovewer:

  • replacing __syncthreads() with barrier(CLK_LOCAL_MEM_FENCE)
  • changing from sqrtf(x) to sqrt((float)x)
  • constant memory not allocated statically like in cuda, but dynamically (with __constant kernel argument, and appropriate call to clSetKernelArg() on the host side).

That were the only changes made.

Unfortunately opencl version consumes more registers than native cuda version.
Even specifying -cl-nv-maxrregcount (in clBuildProgram) and seting the amount of max registers to number achieved in native cuda compilation didn’t help much - it works (judging by BUILD_LOG from clGetProgramBuildInfo), but there are spills to private memory (“local” in cuda nomenclature) and overall performance of kernel is lower.

I’ve been experimenting with -cl-nv-opt-level build option but achieved nothing.
Both with cuda and opencl I’am using fast math option - checked without it but nothing. Everytime opencl reg usage is higher or there are spills. The difference can be as high as 7 registers and it ruins the performance (lower number of blocks/workgroups can be processed simultaneously on multiprocessor).

What is the cause of such behaviour - poor opencl compiler?

EDIT: Forgot to mention: Cuda compiles my kernels for arch 11, opencl only to arch 10. Since I dont know how to force opencl to compile for particular architecture (is it possible?) I compiled with cuda to arch 10 - no change, still reg usage is higher.