I’m writing an Ant-Simulation.
The Kernel Performance is very bad. In comparsion to standard c++ solution it has a big performance disadvantage.
I dont understand why. The operations in the kernel are mostly without control structures (like if/else).
I made a benchmark, and the OpenCL Kernel Performance is very bad.
(Left Axis: Execution time in ms, Bottom Axis: number of simulated Ants)
Can you give me advice?
You can find the hole code in the git repo, if you are interested (the OpenCL stuff is happening here: https://github.com/Furtano/BA-Code-fuer-Mac/blob/master/BA/clInitFunctions.cpp).
Your kernels could be optimized, but the most important parameter when using a GPU is the local work size.
NVIDIA GPUs for instance are optimized for a local work size of 128, so you should try again with an explicit local work size (and the global work size a multiple of the local work size of course).
Not every use case is suitable for GPU. Your kernel has lots of divergent branches which are generally bad for GPU.
One thing I notice is that you are reading back several buffers and then writing them again. All this data transfer in/out of the cl_mem buffer objects is going to carry a substantial performance penalty. You want to minimize memory traffic wherever possible, and if you don’t need something on the host between kernel calls, don’t copy it back.