Control for busy-waits in blocking commands

l_belev · July 13, 2011, 1:43am

The problem: as it is now (at least on NVIDIA) the driver implements glClientWaitSync with busy-wait instead of releasing the CPU.
I know that releasing the CPU imposes context-switching, which is heavy operation and has bigger latency but sometimes it is really needed.

For example in one my application i need to have a “waiter” thread with the sole purpose to block on fences and raises flags when fences are passed while consuming as little CPU as possible,
whereas various other threads are doing hard work on the CPU (the opengl drawing is done by another thread with shared context).
The working threads needs all the available CPU and wasting it for busy-waiting is extremely unwanted, it degrades the overall performance a great deal.
In contrast, the bigger latency of glClientWaitSync if it was blocking instead of busy-waiting would be completely ok.

My suggestion: Please define a new flag for glClientWaitSync that forces the driver to block the thread (release the CPU) instead of doing busy-wait.

Also it would appear that the driver is doing other internal busy-waits. This is seen by the abnormal CPU consumption by internal driver
threads for no apparent reason.
Again, there are cases when the latency of the wait operaions is less important than the CPU utilization.
Please provide means for the application to express it’s preferences between lower-latency or lower CPU wastage by the driver. Maybe use the opengl hint mechanism.

Alfonse_Reinheart · July 13, 2011, 2:20am

For example in one my application i need to have a “waiter” thread with the sole purpose to block on fences and raises flags when fences are passed while consuming as little CPU as possible

That doesn’t sound like a good idea. Wouldn’t it make more sense to just test the fences when you might be interested in seeing if they’re done? Testing a fence takes very little CPU, so I don’t see the problem. Indeed, you could implement your “waiter” thread exactly that way: just test the fence, and not block the CPU if it isn’t finished yet.

l_belev · July 13, 2011, 4:38am

The threads that need the fence info are not opengl threads - they do not and can NOT have their own current contexts and even may be located in separate process. Please don’t assume other people don’t know what they are doing.

Anyway i only gave some example. But the problem is more general. Please don’t focus on my concrete example and try to find workaround for it, thats not the point.

l_belev · July 13, 2011, 5:17am

Alfonse, there was a guy named Korval in these mailing lists that had his greatest pleasure in endless pointless and mindless carpings and annoying people.
Your behavior resembles very closely to his and for this reason i will be ignoring you from now on

aqnuep · July 13, 2011, 7:36am

This is not something that should be in the specification. It’s the implementation’s responsibility to choose the most efficient way of implementing glClientWaitSync.

When you’ve seen in the GL spec something like “The GL implementation must not use busy-wait for the implementation of the glFinish command”?

I think you should rather write about your problem to NVIDIA.

Also would like to point out that blocking is not necessary that expensive compared to busy wait as people think. I would always vote for blocking instead of busy wait.

l_belev · July 13, 2011, 8:47am

I agree this is somewhat strange thing to be included in the spec, but i don’t like the notion of “implementation’s responsibility to choose the most efficient way”. The implementation has no way to know which is the most efficient way because it depends on the particular application and the particular situation. Thats why i would prefer that there is a way for the app to choose one way or the other.

Then again:

Also would like to point out that blocking is not necessary that expensive compared to busy wait as people think. I would always vote for blocking instead of busy wait.

I agree.

Note that if the driver does blocking (with CPU yield) the application still can do busy-waiting if it wishes so by calling in a loop glClientWaitSync with zero timeout.
In contrast the application can do nothing if it needs to block but glClientWaitSync does busy-wait.

I would prefer glClientWaitSync to never ever busy-wait, but as it seems, “some vendors” are really fond of doing that, and thats the reason i suggested a flag as a compromise: if the flag is not specified, let them do whatever they think is “the most efficient way”, but if the flag is specified, let the application has it’s blocks.

When you’ve seen in the GL spec something like “The GL implementation must not use busy-wait for the implementation of the glFinish command”?

What argument is this? Of course things can change. With time people discover shortcomings in the API and patch them. glFinish also does not have timeout. Is this a reason that glClientWaitSync should not have one neither?

Ilian_Dinev · July 13, 2011, 11:18am

Btw, context-switching in windows is not really a heavy operation. I had measured it take something like 370-450 cycles consistently, on older cpus.
So, could you try polling with


while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED) SwitchToThread();

l_belev · July 13, 2011, 12:25pm

I need exactly the oposite - not to poll.
Thats what glClientWaitSync is doing - polling in a loop and burning all the CPU time it can get in the process.

I need the thread to SLEEP 99.99% of the time and wake only when a gl fence is passed.

SwitchToThread() causes the thread to yield the CPU momentarily but remains runnable and windows will switch to it again. I don’t want that.

The bad thing is, by the documentation, one can assume that glClientWaitSync is a BLOCKING function in the same sense as e.g. the windows WaitForSingleObject() or the unix select(). This assumption sounds very logical to and one designs his software around it.
Then suddenly he discovers that glClientWaitSync actually does dumb busy-loop-polling because someone decided that while their driver is in use it can assume exclusive ownership of any and all the CPUs in the machine and no one else can need to do work on the CPUs!

Alfonse_Reinheart · July 13, 2011, 12:33pm

The threads that need the fence info are not opengl threads - they do not and can NOT have their own current contexts and even may be located in separate process. Please don’t assume other people don’t know what they are doing.

I didn’t say those threads should be waiting on the fences. I said that your “waiter thread” should periodically check the fences and fire off whatever it needs to if it finds that they have completed. If no fences have completed, it can release the CPU and get a timeslice later to test again. This is as opposed to forcing implementations to implement things a certain way.

Unless you’re saying that the “waiter thread” isn’t an OpenGL thread. And if that’s the case, I have no idea how you planned to have it call glClientWaitSync, regardless of its CPU behavior.

Anyway i only gave some example. But the problem is more general. Please don’t focus on my concrete example and try to find workaround for it, thats not the point.

You’re talking about a feature that exists to force implementations to implement something in a very specific way. That is not a minor thing, and it is not something that the OpenGL spec should do. Therefore, if your concrete example can be solved in another way, one that is fairly simple and easily implemented, then that concrete example simply is not a good reason for doing that.

i don’t like the notion of “implementation’s responsibility to choose the most efficient way”. The implementation has no way to know which is the most efficient way because it depends on the particular application and the particular situation.

But you have all the tools you need to implement it yourself in the way that is most efficient for your needs. I consider the timed version of glClientWaitSync to mean, “I don’t care how you halt the CPU”, since the untimed version (wait time = 0) already allows you to implement the exact behavior you need.

You are asking for something that you could do yourself. And in so doing, guarantee that it would have the efficiency you need.

l_belev · July 13, 2011, 12:48pm

Ok, please tell me how to do what i need. I still fail to understand how.

Alfonse_Reinheart · July 13, 2011, 12:58pm

You do what Ilian Dinev:

while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED)
    SwitchToThread();

Where “SwitchToThread” is a sleep function or whatever.

If you pass 0 for the wait time to glClientWaitSync, then it will test the fence and return immediately. Either the fence will have been completed, or the timeout of 0 will have expired. If it expired, then it hasn’t been completed and you can relinquish your timeslice.

l_belev · July 13, 2011, 1:12pm

while(glClientWaitSync(hsync,0,0) == GL_TIMEOUT_EXPIRED) SwitchToThread();

this code will not work for the following reason:
let’s say it just executed SwitchToThread(), which managed to find another thread ready to run and so our waiter thread is put to sleep. In this moment some fence gets passed by the GPU, but our waiter thread remains sleeping, because the condition for it to wake has NOTHING to do with the fences - it will wake when windows decides to give it slice again.
This defeats the whole point of the waiter thread - it was supposed to react immediately on fence passage. Well a delay of a few context switches is ok, but not much more.
But with this code we have a potential delay equal to the windows time slice.

The windows time slice can be very long. I think typically it is 5 or 10 ms, but can be over 100 ms. In any case it is whole eternity compared to the timings we need.

Alfonse_Reinheart · July 13, 2011, 2:09pm

In this moment some fence gets passed by the GPU, but our waiter thread remains sleeping, because the condition for it to wake has NOTHING to do with the fences - it will wake when windows decides to give it slice again.

This defeats the whole point of the waiter thread - it was supposed to react immediately on fence passage.

How is this different from what you’re asking for? Do you believe that the driver is somehow able to wake a user-created thread immediately upon the completion of a fence? I highly doubt it.

Does Windows even have a way to immediately wake a thread? Or is that function just a way of saying, “The next time you pick who gets a timeslice, give priority to this thread.” There is no guarantee that the given thread will awaken immediately or in the immediate future.

If the driver is blocking on a fence and relinquishing its timeslice, there’s no guarantee that it will get a timeslice immediately after the fence completes. Fences are not the same as OS mutexes, and even if they were, not even OS mutexes guarantee that blocked threads will get a timeslice immediately after the mutex is switched.

The only way to ensure a prompt response to a fence being completed is to sit there and wait for it. Once you relinquish your timeslice, when you get another one is in the hands of the OS. This is just as true for the driver as for your code.

Therefore, even if this proposal were accepted, the simple fact is that it wouldn’t get you the timings you want. There would always be a “potential delay equal to the windows time slice.” If that is unacceptable to you, then the only alternative is to waste precious CPU time sitting there and waiting for the fence to complete.

Which is likely why NVIDIA implemented it this way. If you want to give up your timeslice (and therefore have the OS decide when you get another), you have the means to do that with the aforementioned code. But if you want to sit and wait within your timeslice, you have a way to do that by giving glClientWaitSync a non-zero time to wait.

There is no third option. There is no way to give up a timeslice and be given one immediately after a fence completes. That’s simply not possible, for you and for the driver.

l_belev · July 13, 2011, 2:46pm

Well, i’m not going to present a lecture about how modern OS-es work, but yes, the blocking system calls (like WaitForSingleObject) most definitely have far finer time granularity than the slice. Basically they wake the thread immediately when their condition is met, with the only delay being 1-2 switches from kernel to user mode and/or the other way around and a thread context switch, unless there are complications (e.g. all available CPUs are busy with higher-priority threads).

Do you really imagine that all inter-thread synchronizations (critical sections, etc.), which are based on blocking system calls have timing accuracy of ~10 ms?
You are not being serious are you?

Ilian_Dinev · July 13, 2011, 2:56pm

Could you then quickly tell the KeXYZ function that a DPC queued by the ISR must use to force a specific thread to be resumed immediately after IRQL is low enough? I can’t find it yet.
(and btw, I doubt the sync specification requires hardware to have facilities to issue IRQs on command-completion; it seems it’s always been easy and preferable so far to simply map some device-memory to userspace sysram, and just dump some data to 4-16 bytes there, and have the userspace part poll that value)

l_belev · July 13, 2011, 3:03pm

was this question for me?
if so, i don’t know what these abbreviations mean or what is their relation to the subject.

Ilian_Dinev · July 13, 2011, 3:11pm

Oh , and that remark about WaitForSingleObject … sorry, but no. Reread the DDK. WFSO removes the thread from scheduling, gets checked annually for its timeout, and should the object be signalled, the thread is queued for scheduling; and then it hopes to take a timeslice sometime this week. So, not “finer granularity than a slice”, but “hope for 100 other running threads to quickly finish with their polling, while 500 threads are sitting in the non-scheduled graveyard it just came out of” .
Right?

l_belev · July 13, 2011, 3:12pm

Aha now i think i start to get an idea what you are talking about
Well, i’m not a hardware vendor so i don’t know. I would guess the hardware has the ability the trigger and IRQ signal on fence crossing, which the driver can process and cause the interested threads to be awakened.
All this should not be too hard for the HW vendors to do, because even back in the (good old) VGA times, the IRQ was already in place. Then it only served for vertical retrace signal because the GPU fences were not invented yet. But i would imagine they might have extended it’s function now to include servicing the fences too.

l_belev · July 13, 2011, 3:17pm

I you are wrong here. If waking from blocking syscalls was being rounded up to the next slice, that would make any thinkable multi-threading impossible. But i’m not going to argue on this anymore.

Im not familiar nor interested in the internal guts of ms windows, but that does not mean i don’t know how certain user-space APIs work.
As it is not necessary to know that the water consists of oxygen and hydrogen in order to know how to drink it.

I mean, your argument is invalid in the same way as would be the argument of someone who tells me he knows better than me how to drink water because he knows it’s internal structure and i don’t know it. You see?

Ilian_Dinev · July 13, 2011, 5:01pm

“If waking from blocking syscalls was being rounded up to the next slice” - this happens if another thread with same priority is compute-intensive, so doesn’t let the dispatcher run all living (“ready”) threads multiple times per 16ms (assuming they are just polling and using SwitchToThread).

My point above was that afaik Windows won’t let a device-driver force a switch to a specific thread after io completion, or DPC or ISR, or ever. All the driver can do is give a hint - a small thread-priority boost, a value which is checked behold on the next time the thread-dispatching is weighting its options.

So, with the way the sync api are now, you have flexibility to tune your app for more different things, than if you wanted the driver+OS to try to handle it with heavier code (that will affect everyone else, who happen to care more about get-result-ASAP than yielding to other threads) .