OpenCL slow compiling of multiple kernel code

I have many kernels in my OpenCL file and I am using the NVIDIA OpenCL implementation to compile the code, and it takes a full 40 seconds to do so. Having just 1 kernel in the file takes 0.42 seconds to compile.

I have isolated the slow compile to a single kernel, which now takes 37.xx seconds approximately. Is there a way to speed up the compiling of the OpenCL code itself ?

Any suggestions will be appreciated.


It is recommended to build your kernels just once when the application is installed on the user’s computer and after that rely on prebuilt program binaries.

See clGetProgramInfo(…, CL_PROGRAM_BINARIES, …) to obtain the program binaries during installation and clCreateProgramWithBinaries() to load those binaries every time the application is executed.

Calling clBuildProgram() with programs that have been loaded with clCreateProgramWithBinaries() tends to be a lot faster than with programs created with clCreateProgramWithSource().

I remember some particularly slow (and unexpected) compiles which seemed to be bugs or issues in the compiler.

From memory it had to do with unrolled loops and register allocation, and fiddling with the source eventually got it building faster, but I can’t remember the specifics.