glxSwapBuffers and glxMakeCurrent when streaming to multiple X windowses

imported_sampsa · January 24, 2018, 11:47am

Hi,

This question is about how OpenGL, glxSwapBuffers and glxMakeCurrent operate when there are several 20 fps videos being fed, each one to their own X windowses.

My scheme goes (omitting some detail) like this:

memcpy to PBO memory addresses (there are several threads doing this). There is a “pool” of PBO objects that is being constantly re-used.

The following steps are performed by a master thread (only thread that touches OpenGL):

PBO => texture “swap” at the GPU (each video stream as its own texture set)

[highlight=c++]
// y
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, y_pbo);
glBindTexture(GL_TEXTURE_2D, y_tex); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w, h, format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
glBindTexture(GL_TEXTURE_2D, 0);

// u
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, u_pbo);
glBindTexture(GL_TEXTURE_2D, u_tex); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w/2, h/2, format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
glBindTexture(GL_TEXTURE_2D, 0);

// v
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, v_pbo);
glBindTexture(GL_TEXTURE_2D, v_tex); // this is the texture we will manipulate
glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, w/2, h/2, format, GL_UNSIGNED_BYTE, 0); // copy from pbo to texture
glBindTexture(GL_TEXTURE_2D, 0);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0); // unbind // important!
glBindTexture(GL_TEXTURE_2D, 0); // unbind

// glFinish(); // maybe not use this (see below)



3) Drawing the image.  Each video has its own window_id, VAO, texture set etc.  Drawing goes approximately like this:

[highlight=c++]
glXMakeCurrent(display_id, window_id, glc) // each video in its own X-window ..

glViewport(0, 0, x_window_attr.width, x_window_attr.height);
glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT);  // clear the screen and the depth buffer .. this can be commented out

shader->use(); // use the shader
  
glActiveTexture(GL_TEXTURE0);
glBindTexture(GL_TEXTURE_2D, y_index);
glUniform1i(shader->texy, 0); // pass variable to shader

glActiveTexture(GL_TEXTURE1);
glBindTexture(GL_TEXTURE_2D, u_index);
glUniform1i(shader->texu, 1); // pass variable to shader

glActiveTexture(GL_TEXTURE2);
glBindTexture(GL_TEXTURE_2D, v_index);
glUniform1i(shader->texv, 2); // pass variable to shader

glBindVertexArray(VAO);
glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0);
glBindVertexArray(0);

if (doublebuffer_flag) {
  std::cout << "RenderGroup: render: swapping buffers "<<std::endl;
  glXSwapBuffers(display_id, window_id);
}

I have timed all OpenGL operations and it seems that the problem is in GLX

A) A Single video - works nicely

[ul]
[li]PBO=>TEX takes systematically 3-4 ms [/li][li]Swap buffer might alert every… 10 seconds or so, that glxSwapBuffers takes around ~ 3 ms … but most of the time its less [/li][li]However, if I start to resize the window, then it might take 5+ ms. [/li][li]… btw, why is this? The window system blocks glxSwapBuffers? [/li][/ul]

B) Two videos - doesn’t go so smoothly …

[ul]
[li]glxSwapBuffers takes constantly 4+ ms [/li][li]glxmakecurrent starts to take sometimes 10+ ms (quite sporadically) [/li][li]… these have consequences to PBO=>TEX as the whole OpenGL pipeline seems to stall [/li][li](Late frames are dropped and not being PBO=>TEX’d & rendered … so there is no saturation from that) [/li][/ul]

With more videos, the whole thing just stalls.

What am I doing wrong here…? Please, help me understand the following issues:

When I have two videos, there can be two frames coming immediately one after the other (one from camera 1 and the other from camera 2). I’m actually testing with two streams coming from the same multicast source, so the frames arrive at almost identical times. Because of this, we can have:

[ul]
[li]Rapid consecutive glxMakeCurrent calls [/li][li]Rapid consecutive glxSwapBuffers [/li][li]etc. [/li][li]There is an individual YUV texture set for each stream, so textures will be overwritten at each 40 ms or so … enough time for them to get rendered I guess. [/li][/ul]

OpenGL should just queue these requests … right? How about the glx calls … they do some blocking?

For example, if I take out that glFinish() call, then the PBO=>TEX part exits immediately, is queued by OpenGL and executed … when? At the latest, when glxSwapBuffers gets the next screen refresh? Removing all glFinish calls does not seem to help things.

Regards,

Sampsa

Something I’ve found:

https://www.khronos.org/opengl/wiki/Swap_Interval#In_Linux_.2F_GLX

https://www.gamedev.net/forums/topic/584025-glxswapbuffers-very-slow-potential-problems/

Dark_Photon · January 24, 2018, 7:14pm

[QUOTE=sampsa;1290259]This question is about how OpenGL, glxSwapBuffers and glxMakeCurrent operate…
…
I have timed all OpenGL operations and it seems that the problem is in GLX[/QUOTE]

You’ve got a number of questions here, so let’s talk about them separately. First, re *SwapBuffers() time consumption…

It looks like that only because of the way OpenGL works.

When you issue a GL command, it is not (typically) executed immediately. It is queued and executed later. So if you time the commands on the CPU, most appear to be very cheap (when in actuality some of them may take non-trivial time on the GPU and/or in the GL driver when they finally execute).

*SwapBuffers (whether glX, wgl, egl, etc.) is often the only command you issue that has an implicit glFlush built in. This tells OpenGL to “get to work” executing all those commands you just queued. Further, for it to complete, it needs to finish processing all those rendering you’ve given it, and perform swap buffers processing (downsamples, blits/flips, waits on swap chain images, etc.). Your GL driver may or may not wait on this processing to occur before returning from *SwapBuffers. Whether it does depends on your GL driver, it’s configuration, and sometimes the windowing system.

So, bottom line is sometimes what you observe as SwapBuffers time is sometimes the accumulated time it takes to finish executing your queued GL commands for the current frame, and perform swap buffers processing.

Now as to *MakeCurrent time…

It’s very expensive, so you want to avoid calling this whenever possible. Besides the cost of completely unbinding/rebinding contexts, it forces a complete catch-up of all the queued work on the context (like a glFinish). I would shoot for a model where you have one active context per thread which is rarely if ever unbound or rebound. Just as a test, try putting a glFinish() right before your glXMakeCurrent and then see how much time is actually finishing up queued work vs. the cost of switching the bound context.

PBO=>TEX takes systematically 3-4 ms

How do you really know? The GL driver may defer this texture transfer (and associated tiling) until the next time you need to read from this texture.

With more videos, the whole thing just stalls. What am I doing wrong here…?

As a starting point, I’d get rid of the multiple windows/contexts and render into different viewports of a shared window with a single context. Then, when that’s working very well for N videos, extend to your more general use case.

OpenGL should just queue these requests … right? How about the glx calls … they do some blocking?

Depending on your GL driver and driver config, all of these GL calls and glXSwapBuffers could just be queued by the driver. OTOH, again depending on driver/driver config, *SwapBuffers could block your CPU thread until all the queued rendering finishes up before returning.

For example, if I take out that glFinish() call, then the PBO=>TEX part exits immediately,

There you go. That’s one tip to you that you’re not timing the actual execution of the commands, but just the queuing time. glFinish() is different though. It says “execute all that queued work I gave you right now and don’t return to me until it’s all finished!”

is queued by OpenGL and executed … when?

Whenever the driver decides to schedule it. Commands like *SwapBuffers(), glFlush(), and glFinish() force the driver to get to work on the queued work (there are a few others), and the driver will internally decide to do so every so often based on internal GL driver behavior.

imported_sampsa · January 24, 2018, 11:59pm

Thanks Dark Photon - now we’re talking!

About a model where there is “one active context per thread” … I’ve experimented with OpenGL & multithreading and I believe that there the problems is, that when different threads make OpenGL calls, there is a also a context switch between those threads (i.e. the “normal” context switch - not OpenGL context switch) … and that - to my experience - creates another set of problems. Shared contexes between threads, etc., right?

In the present case, we are merely switching between different windoses - I don’t get it … why we need different OpenGL contexes per window? But I guess there is no way around this.

About the solution “render into different viewports of a shared window with a single context”. In this case, we don’t depend on the X11 windows system, but make a “windowing system” of our own inside a single X11 window. This is what you mean?

Is there any way to circumvent X11 alltogether and just address rectangular areas on the screen without context switches, etc.? Maybe with EGL ? Any suggestions?

Dark_Photon · January 25, 2018, 5:50am

Yes, there will be thread context switches. Can’t really do much about that with that design, save do all the GL context switching in one thread, which as you’ve observed has a non-trivial CPU cost. The potential advantage of multiple threads, one-context-per-thread, is you don’t have the overhead of a bunch of *MakeCurrent() calls going on all the time as well (unbinding/rebinding contexts + implicit glFinishes).

Shared contexes between threads, etc., right?

No. One context per thread, created and made current on startup, and never unbound thereafter. Now you may very well decide to share objects between those contexts, depending on your design.

That said, in the end you may try this approach and instead decide to go with a model where you just have one GL thread, one context, and you just serially render to all of the windows (as you’re doing). Or a model with one GL thread, one context, and you render to multiple viewports within a single OpenGL window.

This link has some good info related to this topic: Parallel OpenGL FAQ (Equalizer)

In the present case, we are merely switching between different windoses - I don’t get it … why we need different OpenGL contexes per window? But I guess there is no way around this.

Think about it this way. A GL context represents “all” of OpenGL’s state, including the hidden state of the command queue you’re submitting commands to, handles to all resources managed in GL driver memory and GPU memory, the state of all outstanding jobs running on the GPU for that context, and the state of the “default framebuffer” (your window, in most cases). If your default framebuffer (window) changes within a thread, you need to unbind that context and rebind one to activate it on the new default framebuffer (window). So you can see how switching the bound GL context might entail quite a bit of work.

Now strictly speaking, you can share a context between windows if they have the same visual/fbconfig (i.e. pixel format). But there’s still and expense from switching from one window to another.

One way around this GL context switching overhead is just to not use different “window system drawables” (e.g. window, P-Buffer, or pixmap) for rendering different framebuffers, but rather use off-screen Framebuffer Objects (FBOs) which do not require a GL context switch to switch between. Alternatively, you can target rendering to different viewports within the same window; that won’t require a GL context switch either.

About the solution “render into different viewports of a shared window with a single context”. In this case, we don’t depend on the X11 windows system, but make a “windowing system” of our own inside a single X11 window. This is what you mean?

Sort of, though I’m not proposing you make these viewports (just rectangular sub-regions of your window) behave like desktop windows, where they are individually movable, push/poppable, have window decorations (borders), etc. That’s probably way overkill for your needs.

Is there any way to circumvent X11 alltogether and just address rectangular areas on the screen without context switches, etc.? Maybe with EGL ? Any suggestions?

You can create a full-screen borderless window and do what you want (e.g. target rendering to separate viewports/subregions within it).

You might also be able to render to the root window, but I think you have to just take its visual/pixel format as-is and deal with it. Also, this would have the disadvantage of rendering beneath all your desktop windows.

imported_sampsa · January 25, 2018, 8:41am

Hi,

Thanks again.

It seems I got confused … I am calling glXMakeCurrent(display,drawable,context) frequently when targeting different windowses. But only thing that changes is the “drawable”, while “context” is always the one-and-only opengl context - my implementation is serial as you said.

Surely glx wouldn’t be stupid enough to make context flushes if I call glXMakeCurrent with the same context (i.e. only thing that changes is the drawable).

Well… I don’t really know what to do with this. There is the issue of two frames coming in almost simultaneously, fast-switching between the drawables and somehow blocking the whole thing.

I think I debug with glFinish after PBO=>TEX and rendering and comment out glxmakecurrent and swapbuffers … that should tell me something

EDIT

I did the following:

Put a glFinish after the PBO=>TEX loading

Put another glFinish after all rendering, but just before glxswapbuffers

Commented out glxswapbuffers

So now I’m doing the whole rendering pipeline, measuring all times. Only thing missing is the glxswapbuffers.

Results:

PBO=>TEX loading takes 3-4 ms per bitmap
glxmakecurrent does not take time at all - very sporadicaly sometimes a few ms
No frame-dropping anymore

So, it works! …
… except I don’t get any video on screen because swap buffers is disabled.

What’s going on!?

I need to know how glxswapbuffers works … this has something to do it waiting for the refresh rate or something like that.

GClements · January 25, 2018, 9:49am

With double buffering, you can’t consistently generate frames faster than they’re being consumed. Even if the driver defers the actual buffer swap and returns immediately, now you have one framebuffer (the front buffer) being scanned out and another (which was until now the back buffer) waiting to become the front buffer at the next vsync. Neither of those can now be rendered to.

Now suppose that you immediately start issuing more commands. Initially they will just be enqueued, but eventually they’re going to be executed, and for that to happen there needs to be a framebuffer to render onto. So with only two buffers, execution will stall until there’s a framebuffer available. Even if the driver has a pool of buffers, if you render frames faster than they can be displayed eventually the entire pool will be in the queue for scan-out. And if there are no implicit flushes elsewhere, the one in glXSwapBuffers is where the stall will occur.

Triple buffering avoids that by having a third buffer which is always available for rendering. When you perform a buffer swap, the freshly-rendered back buffer immediately becomes the “candidate” for the front buffer at the next vsync and any previous candidate immediately becomes the back buffer, available for rendering. The drawback is that you spent all those cycles rendering a buffer which was never used. Regardless of how quickly you can render frames, there’s a fixed limit of how many the monitor can display per second. If you render faster than that, it’s just wasting cycles (and power). The main advantage of triple buffering is that if you can’t quite keep up with the refresh rate, you get whatever rate you can handle rather than an integer fraction of the refresh rate (because it will end up swapping every second, third, etc vsync, so you get a half, a third, etc of the rate).

imported_sampsa · January 25, 2018, 11:37am

Hi,

Thanks for the comments, but …

Frames from only one stream are being written on each window (and to that windows double-buffer).

Each stream has max. ~ 20 fps, i.e. frame interval of ~50 ms that is much slower than the typical 16 ms refresh interval.

So what’s the problem… or maybe I didn’t get it ?

Dark_Photon · January 25, 2018, 6:30pm

[QUOTE=sampsa;1290283]HNo frame-dropping anymore

So, it works! …
… except I don’t get any video on screen because swap buffers is disabled.

What’s going on!?[/QUOTE]

GClements has already given you some good stuff here, so I’ll just add to what he’s said.

First, I assume for you to be getting the times you are getting, that you already have Sync-to-VBlank (aka VSync) disabled. That is glXSwapInterval( interval = 0 ). Again, tailing onto GClements’ glXSwapBuffers() comments, this tells the GL driver not to wait for the next vertical blank interval, but to swap immediately as soon as a new frame is available. This is fine for testing and benchmarking, but it has the disadvantage of causing tearing artifacts.

I need to know how glxswapbuffers works … this has something to do it waiting for the refresh rate or something like that.

At a high-level, with pure double-buffering, rendering directly to the display (without a compositor in-the-loop; you don’t want one of these if you care about efficiency), glXSwapBuffers processing goes something like this:

Finish up all rendering to the default framebuffer,
Perform any downsampling operation necessary (if a the default framebuffer was allocated multisample),
Wait for the current frame to finish scan-out (i.e. wait for the VSync clock)
Swap buffers to display the newly rendered frame (this may involve a blit or a flip).

Disabling VSync gets rid of step #3.

That’s double-buffering. However, on some GPUs+drivers, there are other options. Which GPU and GPU driver are you running there?

For instance, NVidia drivers have Triple Buffering. This gives you a 3rd buffer in your swap chain, rather than just 2 buffers as in double-buffering. The buffers are still processed as a ring buffer (a FIFO queue), so if your app is rendering faster than the display, your app will eventually block. But triple buffering allows your app to start rendering the next frame before the currently-displayed frame has finished scan-out.

There’s also an option called Fast Sync on some of the newer GPUs. This is like triple buffering but there’s no longer a ring buffer. Frames can be skipped. The frame displayed next is “the newest completely rendered but-not-yet-displayed frame” as opposed the “oldest completely-rendered but-not-yet-displayed” frame. This yields low-latency app frame rendering, less time blocked in the draw thread, and avoids tearing, but it can consume wasted cycles and power (I think this is what GClements was describing as triple buffering).

You might give Fast Sync a try if it’s available to you, particularly if you like the frame draw performance of disabling VSync but don’t want tearing.

With both the previous methods, you’d of course have VSync enabled to avoid tearing.

imported_sampsa · January 26, 2018, 12:36am

Hi,

Thank you both of you for contributing into solving this! I really appreciate.

But I think there is a misunderstanding here … I am not sending frames faster than the refresh rate (per window, at least).

One picture tells more than thousand words, so let’s visualize this:

No multithreading
Just a single thread (“master thread”) that calls OpenGL commands
Master thread has received almost simultaneously frames for three streams … now its putting those streams through OpenGL


                                        OpenGL queue                                                        VSync (could be elsewhere..)
-------------------------------------------------------------------------------------------------------------|--------------------------
To window 1 [TEX_1]* [R_1] [CTX_1]* [SWAP_1]                                                                 |
To window 2                                 [TEX_2]* [R_2] [CTX_2]* [SWAP_2]                                 |
To window 3                                                                 [TEX_3]* [R_3] [CTX_3]* [SWAP_3] |
-------------------------------------------------------------------------------------------------------------|--------------------------


*          : glFinish()
TEX_N      : PBO=>TEX
R_N        : render: glBindVertexArray, glDrawElements, etc.  Does YUV => RGB interpolation on the GPU using fragment shaders
CTX_N      : glXMakeCurrent(display_id, WINDOW_N, ..); 
SWAP_N     : glXSwapBuffers(display_id, WINDOW_N); PER WINDOW!

Each window receives a frame on an interval of 40-50 ms (much slower than the refresh rate).
There are lots of SWAP calls … faster than the refresh rate, but …
… PER WINDOW, there is a SWAP call only each 40-50 ms

I was suspecting that the GPU interpolation would be sluggish and kill the performance, but that is not the case : when I comment out the SWAP_N commands, everything works smoothly! glFinish calls force the GPU to do all heavy weight stuff (interpolation, etc.)

But without SWAP I get no video on the screen, of course.

I’ve tried both with and without vsync (calling either glXSwapInterval(0) and glXSwapInterval(1) before starting to feed frames) but that seems to make no difference.

What a mystery.

Dark_Photon · January 26, 2018, 5:06am

Good info! Thanks for posting the pictures and the detailed explanation.

I’m not quite understanding why your R_N is before you’ve actually bound WINDOW_N, rather than the reverse. Seems like an off-by-one, but that’s probably a side issue.

You didn’t say which GPU and GPU driver you’re working with. Some of this behavior is going to vary per driver.

As a test, try reconfiguring your windows as single-buffered rather than double-buffered. Then instead of calling *MakeCurrent/*SwapBuffers on each window, call glFlush(). You’ll get tearing, but it would be interesting to see if this performs better for you.

Re *SwapBuffers cost, you might see if your GL driver has a setting for the max number of queued frames. On NVidia drivers, it’s called “Maximum pre-rendered frames” (not exactly the clearest name IMO). It regulates the number of frames that your GL driver will queue commands for ahead of the frame that’s actually being rendered. In other words, how far ahead of the GPU the CPU is allowed to get (how “deep” the command queue is in frames). IIRC, once you get to this limit, the GL driver starts blocking in SwapBuffers. You can try adjusting this to see it it makes any difference in how often you block in swap or the min/max Swap time consumption. But with you rendering to each window much slower than the VSync rate (40-50ms vs. 16.6ms), this may not make much difference. It could also be (and I’d almost bet on it) that your rebinding GL contexts so frequently will totally thwart this setting, as part of unbinding a context is catching up on all the queued drawing. How you’re using GL is just not intended to be a fast path. Binding contexts is supposed to be rare.

There are lots of SWAP calls … faster than the refresh rate, but …

As far as queuing of SwapBuffers in your draw thread, that’s only going to be true in general if you have VSync disabled, or you are using something like Fast Sync, or you are submitting frames slower than the VSync rate (which you are).

As far as the actual execution of SwapBuffers on the GPU/GPU driver, that’s only going to be true if you have VSync disabled.

Note that both assume you have double-buffering enabled, because SwapBuffers is a no-op if you don’t.

I would seriously consider trying the single-thread, single-context, single-window, multi-viewport option so you can eliminate your GL context switches. That’s a use case that is likely to net you better performance without much work.

If you really do need separate window-manager-managed top-level windows, you could also look into having your GL draw thread render or copy your video frames into buffers/resources available outside of OpenGL (e.g. GLXPixmaps). Then you can just use Xlib to blit the contents of the GLXPixmap to the correct X window, potentially avoiding the need to do any GL context switches to a different window. See the forum archives for details. A single process can easily draw to multiple X windows. So your problem becomes just how to do those draws/blits from a single thread in the cheapest manner possible. The advantage of this method is you don’t need any additional libraries or tools (you’re already linking with GL and Xlib), nor do you need multiple threads.

Also, since you’re dealing with rendering video frames on-screen, and you’re on Linux, have you surveyed the Xv extension (sometimes called XVideo extension). This is provides for GPU-accelerated video playback, resizing, and I think color space conversion. There’s also XvMC (XVideo-Motion Compensation). Other related APIs: VDPAU, VAAPI, and Crystal HD. If you have a specific GPU/GPU vendor in mind, there are probably other options as well.

What’s the source of your YUV video frames? Is it MPEG-encoded video loaded off disk or received over the network? There are fast decode-and-playback libs for that too.

imported_sampsa · January 26, 2018, 6:00am

Good news!

Mystery solved! It was, after all, about glXSwapInterval.

Calling glXSwapInterval(0) did the trick. The quirk here is, that it must be called before creating X windoses (and in the same context).

My problem was that …

The target X-windoses were created from a separate Qt thread.
After that, I start my own thread that controls OpenGL (with a separate context)
I call glXSwapInterval(0)
I take the window ids from the Qt-created windowses and pass them to my library … and render to those windowses
=> glXSwapInterval(0) has no effect on the windowses

The problem disappers if I …

Start my own thread and context
Call glXSwapInterval(0)
Create the X-windoses using my own thread / context
Render to “my own” X windoses

At the moment I’m rendering four full-hd videos, each at 25 fps on the screen … and the linux box doesn’t even glinch.

Thanks for the help… it would have been impossible to have my “heureka” moment without this discussion.

I’m not sure if this breaks the rules of this forum, but I must advertise the project a bit…

https://github.com/elsampsa/valkka-core

Regards,

Sampsa

imported_sampsa · January 26, 2018, 9:46am

As a side note, I think

export vblank_mode=0

Does the same thing. One can test, by running first

glxgears

An observing the framerate that’s written to the terminal, and then repeating the same test with

export vblank_mode=0
glxgears

Dark_Photon · January 26, 2018, 5:57pm

[QUOTE=sampsa;1290302]Good news!

Mystery solved! It was, after all, about glXSwapInterval. Calling glXSwapInterval(0) did the trick.[/QUOTE]

Good. I’m glad you got it figured out. Sounds like you were just waiting on VSync sometimes.

You’ll see tearing now on your video windows. If that becomes a problem, you now have several options to fix that.

The quirk here is, that it must be called before creating X windoses (and in the same context).

I hear what you’re saying, but it’s supposed to be dynamically changeable. What GPU and GL driver are you testing on, anyway?

Dark_Photon · January 26, 2018, 6:10pm

[QUOTE=sampsa;1290307]As a side note, I think export vblank_mode=0

Does the same thing…
[/QUOTE]

That doesn’t work for all GL drivers (though it does for Mesa3D). For NVidia GL drivers, you can use:
export __GL_SYNC_TO_VBLANK=0

Or you can use nvidia-settings to force this from their GUI (or from the command-line).

Here’s a cheat sheet for these two and other GL drivers: Disable vertical sync for glxgears (stackoverflow).

Best bet for cross-platform, just use glXSwapIntervalEXT().