i have one algorithm, that can be implemented using 34 work items executing the same kernel (clEnqueueNDRangeKernel), i.e. SIMD (data parallel method) in OpenCL. in this case, only 34 work items are used, and the GPU is quite low utilized.
In order to measure the maximum throughput on the GPU, i want to push as many execution of such algorithm instance as possible to the GPU so that all computation elements can be used. i.e. i want to do task paralllel as the same time. Can anyone tell me to how to do that? my understanding is that command queue in opencl is like a one server queue, two clEnqueueNDRangeKernel commands can’t be executed at the same time on the GPU even though there are resource available… how can i make the device execute multiple algorithm instances with data parallellism in the algorithm?
The next generation Fermi cards will be able to execute multiple kernels at the same time. However, current cards can only execute one kernel at a time, there is no way around this. Why not place each 34 work items into work groups, and then launch many work groups with a single kernel invocation.
Remember, you need to have thousands of work items running in order to make full utilization of a GPU.