Some examples I’ve seen simply use the first device returned by clGetContextInfo with CL_CONTEXT_DEVICES. This is obviously fine for single-GPU systems, but what happens on multi-GPU systems? Will all but one GPU sit idle, or does OpenCL spread the load to all devices even if there aren’t any command queues created for them? What is the right way to make sure a program scales well from single to multiple GPUs (devices)?
You’ve got it right, I believe. Only devices for which you’ve created command queues and assigned some work will be used; the runtime doesn’t do any sort of automatic load balancing.
As to the “right way” to ensure scalability, I can only offer the truly sucky “It depends on your problem.” Keep in mind that you’ll have to either copy all the data for your problem to each card, or split the data up on the CPU based on the number of cards you’re working with. Combining results is your job as well.
Thanks, that cleared things up a bit.
And how do multi GPU cards like some high end GeForce 2xx into play? Do the GPUs share the memory (I suppose they should). Do they appear as different devices, or as one compute device?
I think that these cards (like the GTX295) just show up as two devices with two separate memories. So from the perspective of the OpenCL runtime, it’s no different than just chucking two physically separate cards in the machine.