Looking at clCreateCommandQueue from the OpenCL specification,
it seems clear that one can create multiple command queues from the same device.
Why does exist this possibility? Is it to increase the performances in a specific case?
Is there any case where multiple command queues (bounded to the same device)
can have better performance than a single command queues? one per compute unit?
This feature is provided for flexibility, in particular it is possible to imagine an application that is writing/reading to a device via one queue and enqueuing kernels with another. The question of performance is going to be both implementation and application defined but of course it is possible that multiple command queues could give better performance.
The next generation NVidia chip (Fermi) will be able to run up to 16 kernels at a time. So I imagine running multiple kernels through multiple command queues would be another way to saturate the chips to hide memory latencies.
If your device supports out-of-order queues, you should be able to get equivalent performance (depending on the implementation, of course) as multiple queues. My only insight into this is that you should be able to enqueue to multiple queues on the same device from multiple threads safely.