glFlush or glFinish with mulithreading?

imported_pjcozzi · June 16, 2010, 5:47am

Hello,

We are doing multithreaded GL using a (shared) context per-thread with 3.2 core profile. The main thread renders and worker threads create and fill textures, vertex buffers, etc. The question is when a worker thread is done preparing a GL resource, should it call glFlush or glFinish so it can be used by the main thread?

After reading (and re-reading) this, I am convinced you only need to call glFlush. This works, as best as unit tests can show, on several different machines with different flavors of Windows and NVIDIA/ATI cards.

On one machine, it does not work consistently, e.g., textures have wrong colors. Although, it does work consistently if we replace glFlush with glFinish. This machine has 4 hyperthreaded cores, Windows 7 64, and NVIDIA drivers 197.45. We think there is a race condition somewhere that is most commonly exposed when the level of parallelism goes up.

We hope this is a problem in our code but it could be an issue in the drivers since even very simple tests can fail, e.g.:

Create context for worker thread on the main thread
Create shared context for main thread and set current
Spawn worker thread
---- Set context current
---- Create and fill texture
---- glFlush (glFinish works!)
Main thread blocks until worker thread is finished
Verify texture contents - fails!

Source for this is here: TextureMultiThreadingTests and TextureFactory

So, should we be able to use glFlush? If not, is glFinish going to kill performance? It is certainty a problem in the single threaded world.

Regards,
Patrick

overlay · June 16, 2010, 8:04am

glFinish() makes sense, glFlush() does not guarantee the completion of the commands.

But with OpenGL 3.2, the recommended way is glFenceSync () with glWaitSync().

The 3.2 spec are pretty clear

“D.3.1 Determining Completion of Changes to an object” page 325.

“Completion of a command 1 may be determined either by calling Finish, or by calling FenceSync and executing a WaitSync command on the associated sync object. The second method does not require a round trip to the GL server and may be more efficient, particularly when changes to T in one context must be known to have completed before executing commands dependent on those changes in another context.”

See also “5.2 Sync Objects and Fences” page 241–246.

Bruce_Wheaton · June 16, 2010, 11:54am

I’m doing something similar, and it was a bit of a nightmare until recent drivers (Nvidia).

I do a glFlush, then poll for a fence to finish (old style nv fence, not the new sync fences). If things seem to be taking too long, I do wait for completion, but I often have other work to do.

I specifically don’t glfinish - I think it will impose a big workload at an arbitrary time, when the screen threads might need to do work. If that makes sense.

Bruce

imported_pjcozzi · June 17, 2010, 6:06am

Good call - for some reason I thought you couldn’t share sync objects across contexts. That might have been true in an older version but that is certainty not the case now.

I have sync objects working (although I still need to test on the 4 core machine). I have a question/comment on ClientWaitSync vs WaitSync. At first only ClientWaitSync worked for me and WaitSync caused deadlock later in the tests. So I added a glFlush after WaitSync as Bruce mentioned and as shown in issue 20 in GL_ARB_sync.

Why exactly is this required? Is it because without flushing the worker thread, the main thread could wait on an unsignaled fence forever since the wait blocks the GL server?

Regards,
Patrick

overlay · June 17, 2010, 6:45am

I don’t understand why it works with glClientWaitSync() without glFlush() but I may have an explanation for why there is a deadlock with glWaitSync() without glFlush().

OpenGL commands are accumulated in a command queue before being sent to the server. The commands are sent only in 2 conditions, a new command is added the queue, making the queue full, so the driver flushes it automatically. If you don’t add an explicit glFlush() after your last command, then there is a chance that the queue is not full, so it will never be flushed.

If FenceSync() is the last command, then it will no be executed, hence the deadlock. If FenceSync() is not the last command, then the following commands will make the queue full and trigger an implicit flush.

In single thread mode, you usually don’t care in double buffer mode because after rendering a buffer swap involves an implicit flush. But glFlush() is required in single buffer mode. It is the same for a thread that does not do OpenGL rendering but just OpenGL resource allocation/initialization.

Don’t take my words for being the truth. This is just my understanding of the issue.

imported_pjcozzi · June 17, 2010, 10:36am

Your explanation sounds pretty accurate given the behavior I am seeing. In the tests I am using, the worker thread(s) only issue a handful of GL commands to create and fill a GL resource then the last command they issue is glFenceSync. So your saying the glFenceSync was sitting around in the command queue, so back in the main thread glWaitSync was waiting on a sync object that was never signaled.

Thanks,
Patrick

mfort · June 17, 2010, 11:26pm

I am also developing application with 2 OpenGL threads (main + worker).

I run into the same troubles with glFenceSync and glWait when there is no other OpenGL command following the glFenceSync. Actually I do not see a deadlock. This is probably caused by timeout in glWait.

What overlay said is probably correct explanation of this situation. But I do not think is it good to assume this behavour is correct. I read the ARB_sync spec several times and there is nothing about using glFlush in connection with glWait.

I am sure ARB was aware of glFlush problem because they solved it in glClientWait by using flag SYNC_FLUSH_COMMANDS_BIT. There is nothing similar in glWait.

I used the glFlush as a workaround. I believe the spec is either changed or NVIDIA drivers fixed. The flush is definitely bad solution as it can hurt performance when there are some other OpenGL commands after the glFenceSync.

I also noticed that glFlush after glFenceSync is more important on Windows XP then on Windows 7 (where it works most of the time).

Alfonse_Reinheart · June 18, 2010, 10:10am

I used the glFlush as a workaround. I believe the spec is either changed or NVIDIA drivers fixed. The flush is definitely bad solution as it can hurt performance when there are some other OpenGL commands after the glFenceSync.

If I understand the behavior of glWaitSync correctly, it stops the GPU from getting any new commands until the sync object has signaled (ie: the fence completed). I’m pretty sure that’s going to hurt performance more than doing a glFlush

What I’m most curious about is why people need glWaitSync to begin with? What is the use case for this function?

imported_pjcozzi · June 20, 2010, 1:39pm

I can’t speak for everyone but I am using it to synchronize between two threads - a worker thread that produces a GL resource like a texture and a main thread that consumes the resource for rendering. Even with application level synchronization, without GL synchronization, it appears rendering on the main thread can start before the resource was fully initialized.

Or are you asking why use glWaitSync instead of glClientWaitSync? Because I actually have the opposite question, why would someone want to block everything instead of just the GL server?

Regards,
Patrick

Alfonse_Reinheart · June 20, 2010, 3:11pm

Or are you asking why use glWaitSync instead of glClientWaitSync? Because I actually have the opposite question, why would someone want to block everything instead of just the GL server?

If you’re using fences to, for example, check to see if a particular buffer object is still in use (so that you can use MapBufferRange’s GL_UNSYNCRONIZED flag), then only glClientWaitSync will let you know when the fence has completed.

Now, the timeout value will be 0 in this case, so you’re not really blocking.

imported_pjcozzi · June 21, 2010, 4:54am

I see. Good call.

Patrick

imported_pjcozzi · June 22, 2010, 10:46am

I’d also like to ask your opinions on best practices for sync objects. Specifically, if you have worker thread(s) creating and uploading GL resource, lets assume textures, that will be rendered by the main thread, where should the wait on the fence happen? I can imagine a few possibilities:

a) A single worker thread uploads a texture then creates a fence. Both of these are put on a queue for the main thread. The worker thread can then continue uploading textures and creating fences. The main thread uses a server wait, glWaitSync, before using the corresponding texture.

b) Several worker threads each upload a texture, create a fence, and immediately issue a client side wait, glClientWaitSync. The texture is not added to the queue for the main thread until the fence is signaled. It is safe for the main thread to use a texture once it is on the queue.

c) A single worker thread uploads a texture then creates a fence. A client side wait with a timeout of zero is used to poll the fence. A texture is not added to the queue until its fence is signled. Meanwhile, the worker thread can upload additional textures.

My thoughts are:

a) May be ok because by the time the main thread issues glWaitSync, the fence may already be signaled. If this is not the case, the wait would stall the GPU but not the CPU.

b) Is a bit harder to manage but could perform well if glClientWaitSync is not implemented with busy waiting and only blocks the GL client for the thread’s context. Does anyone know if this is the case or will other contexts also be blocked? Likewise, does glWaitSync block just the current context or all contexts?

c) I hate to poll the fence but the implementation is a bit simpler than (b) and doesn’t have the potential of stalling the GL server like (a).

What are your thoughts? Are there other worthwhile approaches I have not considered?

Thanks,
Patrick

Bruce_Wheaton · June 22, 2010, 12:37pm

Huh. Further to my replay a few back, I found that in my multi-threaded texture transfer system, I wasn’t actually calling glFlush, to my surprise.

Which is very nice and mellow from a threading perspective, but probably slows things down quite a bit.

I’m trying that now. But it looks like you can skip it, with the caveat that you’re reliant on other commands and/or flushes to get things started.

Bruce

mfort · June 23, 2010, 12:44pm

Polling does not work well for me. See this thread:

http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Main=53526&Number=277431

pdaniell · June 23, 2010, 4:51pm

You do need a glFlush() sometime after the glFencSync() call otherwise anyone that waits in this sync object may wait forever. It doesn’t need to be immediatly after the glFenceSync() call, but just at some point so that it doesn’t unnecessarily stall the other thread. This applies to both glSyncWait() and glClientSyncWait().

The advantage of using glSyncWait() vs glClientSyncWait() is that glSyncWait() doesn’t stall your application. It can continue to queue commands to the GPU, while the GPU itself waits on the other thread. When it’s released the GPU immediatly has stuff to continue working on without having to wait for the application to feed it more stuff.

imported_pjcozzi · June 25, 2010, 5:05am

To be more clear, I would not use GetSynciv in a different thread to poll. The worker thread would create/upload the texture, call FenceSync, and put this pair on a local queue. The worker thread would check the return of ClientWaitSync with a timeout of zero. If the fence is signaled, the texture is moved from the local queue to a shared queue for use by the main thread, otherwise the worker thread goes on with its work and checks the fence later.

You had performance problems calling GetSynciv in a different context than FenceSync, correct?

One question I have is if ClientWaitSync (and WaitSync) will block more than the current thread/context. The spec says it “causes the GL to block” - is that just the current context or all contexts in the process? I suppose this isn’t too important when ClientWaitSync is called with a timeout of zero since it should not block at all.

Regards,
Patrick

mfort · June 25, 2010, 8:46am

Thanks for confirmation. I understand the reason. The only think I was confused was the documentation. I did find any info about this situation. Maybe it worths adding one sentence to ARB_sync document. One the other hand it never hangs when glFlush is not called due to internal platform specific timeout .

Yes.

I’ve never tried ClientWait with zero timeout. I will try it one day. But timeout zero and close-to-zero time spent in this API are two different things. I expect ClientWaitSync (…, 0) to have similar performance characteristic as GetSynciv.

On the other hand I do not see any benefit of having pure worker thread for loading textures. Do some benchmarks and you will see that if you call some function in one OGL thread then the other OGL thread is blocked. You will end up chasing obscure timing problems in you main thread. Some trivial OpenGL API will sometimes takes ages just because your worker thread called glTexImage. (at least on NVIDIA, Windows)

Use PBO for loading textures. It is much better.

imported_pjcozzi · June 25, 2010, 11:33am

This is exactly what I am afraid of. Given different contexts in different threads, you would think a large number of GL calls could execute in parallel (with the understanding that there is probably only a single GPU). Of course, I figure calls like glGenTextures need to lock.

I hear ya. We’ve used this successfully in the past by giving a pointer from glMapBuffer to a worker thread. Do you know if this can be done without pointers, e.g., by calling glBufferData in the worker thread? Or is using any GL in the worker thread a potential problem?

Regards,
Patrick

Alfonse_Reinheart · June 25, 2010, 1:53pm

I expect ClientWaitSync (…, 0) to have similar performance characteristic as GetSynciv.

Why? One of these is a glGet, which is always considered a bad idea in performance code. The other has specific language in its specification as to how it behaves and performs.