I am writing a school project (implementation of Maximum Intensity Projection on GPU (OpenCL)) and now I am facing an interesting problem.
With CL_DEVICE_TYPE_CPU kernel took nearly 2-3 seconds and returns correct results.
This is good result in comparison to C++ program (~30s). But when the same kernel is executed on GPU, it took 60 seconds (with correct result) and computer seem to be frozen.
I have done some research and I thought that it could be in incorrect memory usage, but in this case, it should be same problem on CPU. But it isn’t.
Have anybody encountered similar problem? How can this be solved?
Thanks a lot for replies
CPU : i5 430m
GPU : AMD m5650
It’s not possible to answer your question without understanding the algorithms you are using and seeing the code.
There are several reasons why some OpenCL application may be slower running on a GPU than running on a GPU. For example:
The code memory bound, not ALU bound. What is the ratio of bytes of data transferred from/to global memory compared with the number of ALU instructions?In some cases you may benefit from using local memory.
The code has very complex control flow. GPUs are basically SIMD processors and if different work-items take different code paths execution will be slower.
Some work-items take a lot more to finish than others. You may need to transform the algorithm or break the work into smaller pieces so that this doesn’t happen.
The code or some part of it does not have significant parallelism.
Using a profiling tool to find out where time is being spent in your application and kernels is strongly recommended.