Is there any opencl tools to deal with multiple different GPUs?

tdchen · November 2, 2018, 8:03am

Otherwise I have to judge the capabilities of every GUP, and send different amout of works to them for a huge computing.

It seems an awful task.

Is there any opencl tools to help me use all opencl devices to complete a huge computing work?

Or any suggestion?

thanks in advance.



Dithermaster · November 2, 2018, 4:17pm

Divide your work up into smaller bites and feed them to the GPUs at the rate they can eat them. In more detail: use OpenCL Events with each clEnqueueNDRangeItem. Enqueue 3 jobs to each GPU. As jobs finish (detect this using Events), queue up another to that GPU. The faster GPUs will go through more jobs; the slower one will go through fewer.

tdchen · November 2, 2018, 7:58pm

Thank you Dithermaster.
If I devided my work into small pieces, whether it will take a longer time than the original?
I can’t always devide my data block into pieces, though I can always split the computation.
BTW, I think what you told me is to send a new job to a GPU, just after one of the three job finished. Is that right?
Yours chen.

Dithermaster · November 3, 2018, 3:02pm

As long as the jobs aren’t too small it won’t run slower. Jobs are subdivided by the runtime into what the hardware can do, so once it is doing that larger jobs run at the same speed (ignoring the inefficiency due to the last bits of the job not filling the device). The reason I had you queue up three to each device is so the work queue never goes empty. An empty work queue means an idle device, which means you’re not going as fast as you could. So, yes, after first job finishes queue up another. Then when second job finishes, queue up another. Keep the pipeline moving.

tdchen · November 4, 2018, 6:28am

Thank you so much Dithermaster.
I have tested what you said, they are so true!
During my testing, I thought some problems:

When the data Object was really transfered between main momory and opencl devices?
clCreateBuffer? clSetKernelArg? or clEnqueueNDRangeKernel?
2)How to deal with the GPU used by the OS? If it is overloaded this guy likes to stop you!
Is there any means to find which GPU is used by the OS, I don’t want to touch this camel.
3)If I have 2 layer loop, and the first layer is big enough for parallel. I want to know which one is better:
a)make a 1-dimension clEnqueueNDRangeKernel, and in the kernal make the other layer with a for statement.
b)just make a 2-dimension clEnqueueNDRangeKernel.
I was told the for, if, while statements will badly slow the kernal, but in my case, it let me to avoid the syncronize problem for sum.
I am sorry for so much problem.
Thanks again and again.

Dithermaster · November 4, 2018, 2:11pm

> 1) When the data Object was really transfered between main momory and opencl devices

During clEnqueueRead/Write or clEnqueueMap/Unmap operations.

> 2)How to deal with the GPU used by the OS? If it is overloaded this guy likes to stop you!
> Is there any means to find which GPU is used by the OS, I don’t want to touch this camel.

Use OpenGL “get device” commands to find the GPU being used by the OS. It’s not perfect, but often works.

> 3)If I have 2 layer loop, and the first layer is big enough for parallel. I want to know which one is better:
> a)make a 1-dimension clEnqueueNDRangeKernel, and in the kernal make the other layer with a for statement.
> b)just make a 2-dimension clEnqueueNDRangeKernel.
> I was told the for, if, while statements will badly slow the kernal, but in my case, it let me to avoid the syncronize problem for sum.

Implement and try both and measure results.

tdchen · November 6, 2018, 6:07am

Thank you Dithermaster for the detailed suggestions.
I will try the third one.
I wish you to confirm that:
1)clSetKernelArg also transfer data object to device, right?
2)Does OpenGL have a function like “get device”? I use openGL(version 1.1) very often ,but I don’t know this function.
Thank you again.

Dithermaster · November 6, 2018, 2:27pm

> 1)clSetKernelArg also transfer data object to device, right?
No, it doesn’t cause data tranfers. It’s just used to pass the handle to the cl_mem object.

> 2)Does OpenGL have a function like “get device”? I use openGL(version 1.1) very often ,but I don’t know this function.
Look at glGetString(GL_VENDOR). Check sub-strings for AMD, ATI, NVIDIA, Intel. Compare to similar strings in your OpenCL platform. We use it before trying CL/GL interop; you could use it to avoid the primary GPU.

tdchen · November 6, 2018, 7:22pm

OK!Dithermaster
I think I am ready for my work after your guides.
You make my thought much clear and applicable.
Thank you so much for your professional help.
Yours Chen.

fangqq · December 3, 2018, 10:41am

that’s pretty much what we did in this paper (see the 3 paragraphs below Fig. 3)

we had to do manual benchmarks to get the throughput of each device, and use them to predict the workload. If the job is very long, you can dynamically do the throughput tests by using a small load for each device, and then use that to partition the remaining large load. For me, the overhead is too much for dynamic characterization, so we did it statically.

tdchen · December 5, 2018, 4:47am

Hi, Dithermaster.
I think I have finished what you said.
I made one context for every device.
First, I put 3 subtasks into every device queue, then, after one subtask finished(told by event), I put another subtask into it.
At first, it seems ok, but I found it has prolems.
It is slow, and some times, it can’t finish the whole task.
After carefully study, I think I have found the reason.
The subtask(A clEnqueueNDRangeKernel) is buffered in the queue! So the program can’t be driven by events to the end.
Do you think so?

tdchen · December 5, 2018, 5:15am

Hi, fangqq
I am studying your pape.
Thank you so much.

Dithermaster · December 9, 2018, 9:12pm

tdchen, I don’t know. It sounds like a timeline profiling tool would be helpful.

tdchen · December 9, 2018, 9:27pm

Dear Dithermaster.
Yes, you can’t know, for I found I made a mistake.