Use CPU and GPU wisely

Hey, I’m newbie for OpenCL, just started learning. I wanted to know whether it is possible to execute few threads on GPU and remaining threads on CPU? In other words, if I launch 100 threads and assume that I’ve 8 core CPU then is it possible that 8 threads out of 100 threads will execute on CPU and remaining 92 threads will run on GPU?Can OpenCL help me to do this job smoothly?

You have to open both the gpu and cpu as separate devices and do this sort of stuff yourself (opencl provides all the necessary synchronisation primitives). Or simply use opencl for gpu and c (or whatever) for the cpu code.

opencl ‘threads’ are not threads on a cpu either, each work-group is executed on a cpu single thread using loops. i.e. it’s more meaningful to talk about work-items than threads.

To rephrase my question, let’s say i have 1D array of 100 elements and I want to process last 4 elements on CPU (with four separate threads) and remaining 96 elements on GPU. Can I do this with OpenCL smoothly? I want to use only OpenCL no other language for host side code…

You can do this, and it really is not that hard. One thing you should pay attention to, is how long your kernels take. Using the CPU in parallel to the GPU can result in slower code, than if you wouldn’t have used them.

Let’s take that your GPU finishes with the 96 elements in 10 seconds, and your CPU finishes with the 4 elements in 12 seconds. You can guess yourself too, that if you gave all the work to the GPU, you could’ve finished calculation in 10.1 seconds, faster than using the CPU as well. (This is not theoretical, but happens many times. Even widespread applications like LuxMark (even v2.0) have this application misdesign.)

There is practically one way to go around this problem: profile your application before running actual process to see how many elements the GPU processes in a given time and how much can the CPU take, and always give a few % less work to the CPU, than the GPU.

This first approach is for CPU and GPU multi-device setup only. If you’re creating multi-device application, there are more sophisticated methods that also allow multi-GPU as well, or heterogenous device usage with workload balancing.

Yeah sure, perhaps you need to read the specifications or introductory documentation on multi-device support which don’t preclude this. It’s no different to using 2 gpus from the api perspective. You just have to do the scheduling and job management manually (using opencl apis). And for best performance you possibly need write two different kernels tuned for each type of device as their architectures are so different.

However I wouldn’t see the point beyond intellectual curiosity: for many algorithms a typical discrete gpu is 10-100x (or more) faster than a cpu, so it isn’t likely to be worth it for all the extra overheads and system design, coding, and debugging involved. Splitting memory and moving it between different sub-tasks wont be cheap on a discrete gpu either. And usually a cpu has other things it can be doing in the system such as i/o or setup, or even juts letting the user move the mouse.

Sometimes you cannot parallelize an algorithm, in that case CPU works faster than GPU so I wanted to give such task to CPU and let GPU do the work which it can do in better manner.
I read few research papers about performance difference between CUDA and OpenCL and found that OpenCL does NOT perform better than CUDA so apart from portability is it worthwhile to invest some time in OpenCL to get performance benefit?

People on this forum most likely will tell you, that it is worthwile, that’s why we learned using the API ourselves. :slight_smile:

From your perspective, it pretty much depends. OpenCL in some cases can be faster than CUDA, but generally it underperforms CUDA by 5-10%. (This discrepancy is purely artificial, as kernels that do not use hardcore CUDA features should perform the same.) If you are not willing to sacrifice this performance for portability, then don’t. If you read articles about portability vs. performance, in the long run the prior always proved to be more important.

OpenCL (we think) is worthwhile, because with one carefully written GPU code, you can get roughly 80% of any device you’re using. If someone does not have a GPU at hand, your GPU optimized kernels will perform as a well written, vectorized CPU code, which otherwise you would have to learn another API (OpenMP) also. OpenCL is one very powerful API which if you learn properly, you can target all sorts of paralell HW. CUDA is a very powerful API to target one vendors devices with features that are exclusive to that vendor.

The choice is yours.

I didn’t realise it was a CUDA v OpenCL issue. Which to use it up to you. Amongst other things vendor portability was a killer feature for me. There’s always added risk inherent in being tied to a single vendor (just look at what happened in the consumer space with the GTX 680 for instance).

Anyway - if the algorithm runs faster on a CPU, just run it there, that seems like a no-brainer to me. OpenCL at least provides a path that enhances the optimisability and execution of CPU code as well which uses the same API as the GPU, and that is quite a different case than your original question. The direction OpenCL is headed is toward heterogeneous devices running with unified memory, and the entire reason for that is to make it easier and more efficient to run code on the hardware where it executes most efficiently; i.e. run it on a cpu when it makes sense. This is not what your initial question suggested either - that suggested that your gpu code was 10x faster than the cpu. And in that case the miniscule speed-up you might get in total computational times will very likely be more than eaten up in the extra overheads of trying to run it that way.

Trying to split workloads and manage the memory for un-even splits like your original question will be difficult, buggy, and will likely not perform well.

Forget about the processing “some elements” in CPU. If the task is more difficult as vector addition your CPU will be busy enoph, and the GPU threads would be needed in the values from another threads. More correct question, how to split the work into 2 GPUs if
as sayed before, some values go across. Its extremly few examples about that. I think even developers havent think enoph about many - devices features.

I split my job between GPU and CPU all day long.

For me, such a split results in about 75% of the execution time of doing it all on the GPU. Well worth my while.

I tried a three-way split, adding a second (weaker) GPU that really wasn’t worth it…

For a while I was tuning like 65%->GPU; 35%->CPU depending on which machine I was running on, but that didn’t gain me much so now I’ve just been using 50/50 across machines. [any incremental change i make does not manifest in the finished product unless i cross a 16-ms boundary, as i am generating live frames, so i stopped worrying about the tiny stuff!]

My advice is to spend a day or two implementing it and see what it does for you…