Control for busy-waits in blocking commands

l_belev · July 13, 2011, 11:50pm

This is not true, the other thread will be pre-empted DESPITE being of equal priority.

Think of it this way: the thread which is blocked still had it’s slice unfinished at the time of blocking (apart from Sleep and SwitchToThread, no other blocking syscalls give up the timeslice remainder of the thread). So the blocking syscall essentially “borrows” time to another thread to run when it is still our time to run. When our thread is un-blocked, it immediately is switched to to continues it’s unfinished timeslice.

Alfonse_Reinheart · July 14, 2011, 12:15am

When our thread is un-blocked, it immediately is switched to to continues it’s unfinished timeslice.

So let me get this straight. Every time you release a mutex, every thread that was blocked on that mutex (there can be lots) immediately comes active. So releasing a mutex basically means, “my timeslice is done; somebody else take over.”

I’m afraid I’m going to need to see some documentation or other evidence on that. Especially since you have admitted that you are “not familiar nor interested in the internal guts of ms windows” (and whether you believe it or not, this is all about the internal guts of Windows) So I want to see something that proves that this is how it is implemented.

Also, we’re talking about a cross-platform API in OpenGL. So not only do I want to see documentation on that for Windows, but I’ll need to see some on Linux, other flavors of UNIX, BSD, and Mac OS X. And if this is going to propagate to mobile platforms with OpenGL ES, now I need info for iOS and Android too.

And then there are fences themselves. Fences are not OS mutexes; they are GPU constructs. So even if you can show that, under all of these systems, mutex release will instantly restore a previous thread to functioning, that doesn’t show that fence completion can instantly restore a thread to functioning.

If even one of these GPUs is incapable of doing that, then what you are asking for would be impossible on that platform. And therefore, it would not be a good idea to implement it.

aqnuep · July 14, 2011, 12:41am

I think we slipped too deep into the mud.

What l_belev wants to say is that operating systems (including Windows and Linux) can put the threads to a waiting state from the running state and it will automatically put back into running state as soon as some OS synchronization primitive is signaled by some other thread/process. This way there is no need to switch to the context of the waiting thread, actually the scheduler does not even see the thread while it is in waiting state as it only deals with running threads.

This is where blocking has an edge over busy loop and this is a busy loop:

    while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED)
    SwitchToThread();

In this case the thread still gets scheduled, wasting precious time on context switching, calling driver functions, so on, so on.

While I still believe such an indication has nothing to do in the GL spec, I agree with l_belev in that part that busy loop is never a good choice, even if you are yielding.

Alfonse, as evidence:
At my workplace we where working on a server that we inherited from previous developers (Linux platform).
They said threading is expensive so they implemented the southbound interface on three threads: a sender, a receiver and a worker.
Now they did polling in the receiver thread (aka busy wait, but with yielding). This way the server consumed roughly 30% of the CPU time even when it was in idle state, no requests, nothing.
Now we changed this so that instead of these three threads, there are thousands (1000 to 3000) threads, each dealing with its own job, including sending and receiving internal messages. The key point here is that we used blocking receives (aka select with proper parameters) and semaphores for blocking idle threads.
In this case (even with the thousand running threads) the processor load was less than 0.01% and even in case of handling requests the load barely passed a few percents as most of the time the server threads were communicating with other internal modules, thus the blocking wait enabled them to really go idle, thus not consuming processor time.

Maybe this is not a good enough evidence for you, but believe me, every modern OS implements blocking waits efficiently.

Alfonse_Reinheart · July 14, 2011, 1:48am

What l_belev wants to say is that operating systems (including Windows and Linux) can put the threads to a waiting state from the running state and it will automatically put back into running state as soon as some OS synchronization primitive is signaled by some other thread/process. This way there is no need to switch to the context of the waiting thread, actually the scheduler does not even see the thread while it is in waiting state as it only deals with running threads.

That sleeping threads are woken when a mutex is flipped was never contested. What was contested is when those newly awoken threads get a timeslice. And the OS provides no guarantees on when exactly that may be. It may immediately switch over, or it may wait until the scheduler’s next update, or even longer.

It also doesn’t deal with the deeper issue: that fences are not OS mutexes, but GPU signaled state. Can a client thread block on GPU signaled state without polling? For all GPUs?

l_belev · July 14, 2011, 2:47am

as aqnuep said, this thread has gone too far astray.
If you are interested in learning about mutexes, please find some book on the matter. This forum is about opengl.

Ilian_Dinev · July 14, 2011, 12:31pm

The latest edit to your original post is now nice and coherent. I support the suggested addition of a hint/flag (specified at glFenceSync), and the relaxed timing requirements. This way devices that actually can issue interrupts on such events will be able to wake the thread, while others can implement this by quickly polling a list of sync-objects once per handling some existing interrupt.

imported_Groovounet · July 19, 2011, 5:50pm

Very interesting post! I encounter a while ago that type of problem and it end up pretty ugly and buggy in my code.