Multiple host threads with single command queue and device

weliad · August 22, 2010, 1:46pm

Hello,

Regarding OpenCL 1.1 spec, the glossary defines what thread-safe means for OpenCL:

Thread-safe: An OpenCL API call is considered to be thread-safe if the internal state as managed by OpenCL remains consistent when called simultaneously by multiple host threads. OpenCL API calls that are thread-safe allow an application to call these functions in multiple host threads without having to implement mutual exclusion across these host threads i.e. they are also re-entrant-safe.
And from the appendix:
All OpenCL API calls are thread-safe except clSetKernelArg.
From this I conclude that if an implementation conform to the OpenCL 1.1 spec, it should be possible to enqueue multiple kernels from different threads, onto the same command-queue, with a single device to execute those kernels.

Is this correct? I’m asking because with OpenCL 1.0 this does not seem to be the case, and with the NVIDIA implementation, attempting this causes a crash according to what I read in the message boards.

Can someone confirm?

david.garcia · August 22, 2010, 6:09pm

Is this correct? I’m asking because with OpenCL 1.0 this does not seem to be the case

Yes, it is correct for OpenCL 1.1. In CL 1.0, however, you cannot enqueue commands into the same command queue using multiple host threads. The exact wording appears in Appendix A2:

The OpenCL implementation is thread-safe for API calls that create, retain and release objects such as a context, command-queue, program, kernel and memory objects. OpenCL API calls that queue commands to a command-queue or change the state of OpenCL objects such as command-queue objects, memory objects, program and kernel objects are not thread-safe.

weliad · August 24, 2010, 12:12pm

Thank you for the confirmation.

cantallo · February 5, 2011, 3:02am

In openCL 1.0 is it still possible to braket the call to clEnqueueNDRange…/clEnqueBufferWrite… by a single shared mutex lock (thus forcing only one thread at a time to load into the queue) ?

I think the answer is yes, but the difficulty is the following:

Can multiple thread SIMULTANEOUSLY wait for different events ?
Is clWaitForEvents… reentrant ?

The idea is that each thread is awaken once its own event_list is terminated. For that, it is not possible to use a mutex mechanism because the first come thread will wait for its events but the other will wait for the release of the mutex, hence first for the completion of the first thread events BEFORE checking if their event_list is complete…

What I want to do is a thread «feeding a pipelined processing» with a lock in the middle and an other thread «unloading the results» at a different rate (typically 32 × 32K on input, versus 2k × 10K blocs on output)

Algorithm outline:

[1] read blocs of 32 × 32K data,

[2] process the 32 rows of data (FFT and so on)
put them in a buffer and go back to [1]

[3] when buffer (typically a few 100 or 1000) full, process columns and feed to a second buffer, then rotate buffer an go back to [1] until data end

[4] when second buffer if full, process by row / column / bloc and so on and feed a third buffer

[5] when the third buffer is full, process it and output it.

the problem, is that while le third buffer is not output, the stage [4] should not write to it (an cl_event is ok for ensuring that), but the thread that reads the result from the GPU has to wait for an event (completion of phase [5]) and simultaneously the thread that writes the data to the GPU may wait for the stage [2] to complete…

david.garcia · February 5, 2011, 7:29am

cantallo, is there a reason why you wouldn’t use OpenCL 1.1? In OpenCL 1.0 the only thread-safe operations are creating, retaining and releasing objects; everything else is not thread safe.

cantallo · February 6, 2011, 4:38am

It is simply that I (sometimes) debug my programs at home, and the only openCL capable card there is my wife’s GeForce 8400M GS (on an old laptop) for which the SDK I installed is only openCL 1.0.

Of course, for the normal use of the program, I need some 2Gbytes of videoRAM, hence I use a system with Fermi cards.

That may explain some problems encountered (not mention the 100 speed slowdown)

thanks,