OpenGL device partitioning or time sharing for two parallel workoads?

snoukkis · February 8, 2015, 3:22am

Hi!

I would like to do onscreen and offscreen rendering in parallel. The onscreen job is relatively fast (e.g. 3 ms / frame). The offscreen job is relatively slow (e.g. 100 ms / frame). Both happen periodically, but only the onscreen job is strictly time critical (locked to refresh rate). The offscreen job is updating data for the onscreen job, but it doesn’t matter if it takes a little bit more of less time to get it done.

Continuing with example values:
With a 144 Hz screen the frame period is 6,94 ms. That leaves an excess of 3,94 ms of GPU time per onscreen frame.

Is it possible to utilize that excess GPU time to continue the offscreen rendering job in the background without disrupting the onscreen rendering (i.e. time sharing)? I’d also be fine with any kind of fixed computing resource partitioning.
A CPU world analog would be doing the work in realtime priority foreground and low priority background threads without caring if it’s actually getting done on two separate cores or on one context switching core.

All I can think of is using a frame queue. Unfortunately I need to avoid extra latency. Also on average I would need to store 100 / 6.94 = 14.4 frames. At 1080p that’s 117 MB memory wasted.

I can use OpenGL versions up to 4.4.

Alfonse_Reinheart · February 8, 2015, 7:15am

OpenGL’s internal command queue does not feature preemptive multithreading. You could create two contexts and try to shove commands at both simultaneously, but that will only end up with the driver executing one side’s commands first, then the other (and not in a task-switching, priority-based way). Even in multi-GPU contexts, implementations use the two GPUs to execute the same commands (either in alternate-frame rendering with one GPU chewing on the last frame’s data, or in interleaved rendering with both GPUs executing the same commands).

If you have two GPUs, you could use one of the proprietary extensions for creating GPU-specific contexts. Then you’ll be able to have real parallelism. But otherwise, you’ll have to explicitly do task switching yourself.

snoukkis · February 8, 2015, 8:31am

Too bad. I would have expected something as massivly parallel as GPU to be a good match for a such capability. I vaguely remember it could be done in OpenCL.

Thank you for the information.

Alfonse_Reinheart · February 8, 2015, 9:10am

Actually, it’s the massively parallel nature of the GPU that makes it a bad match for doing that.

Consider what you would have to do to task-switch a GPU the way you do a CPU. You have to stop dozens-if-not-hundreds of computational unit, figure out exactly where each computational unit is within its own shader code, copy several Megabytes worth of local storage out to memory (uniforms and such), copy that shader code to memory, preserve dozens-if-not-hundreds of various fixed-function registers (viewport, blending, texture bindings, etc), and so forth. Then it has to copy in all of the stuff needed for the other rendering operation, then start the pipeline up.

There’s no way that’s going to be fast. Especially since you basically have to shut down all rendering to do that.

OpenCL could handle something like this because all it does is compute. It doesn’t have to use the fixed-function parts of the pipeline. So it’s possible to have multiple compute operations, where you dedicate some percentage of computational resources to individual processes.

That doesn’t mean that something similar would be fundamentally impossible on GPUs. But it would require specialized hardware to handle.

snoukkis · February 8, 2015, 9:30am

Okay, I guess GPU context switching does sound kind of redicules. What about resource partitioning? Say split all the stream processors, ROPs and texture units but share the memory? Would you say that could be done on existing hardware?

Alfonse_Reinheart · February 8, 2015, 10:34am

Bindless texturing would take care of the texture state, and ROPs are already scaled up. But you’d still have some problems.

First up is primitive assembly. For point of reference, it was not so long ago (I think the Radeon 6xxx series) where AMD was touting the fact that their hardware has dual primitive assembly units. So even if high-end hardware had 3-4 assembly units, you could only dedicate a minimum of 25% of your throughput to a secondary task. And given the disparity in how much GPU time you want to give your background process, that probably is too much to dedicate to it.

Another problem is the command processor itself. Compute operations are pretty simple to execute: you pick some number of shaders to execute them on, and you fire them off. So you could reasonably write compute operations for different sets of shader resources, since each compute operation is an island.

For rendering commands, you would effectively want two separate contexts, with two separate command queues that deal with two separate sets of global state. It would be very difficult to emulate that with a single command processor.

And then you’d have to deal with the vertex puller. While apparently AMD’s hardware doesn’t have dedicated vertex pulling logic anymore, that’s far from true for everyone’s hardware. Such dedicated hardware would probably not be designed to simultaneously handle two separate rendering operations. It, like the command processor, would be intended to operate single-threaded.

snoukkis · February 8, 2015, 10:54am

Thank you for explaining it.

Dark_Photon · February 9, 2015, 6:04am

snoukkis, now that you understand the single-GPU situation, I’ll just mention that if you need to do something like this, multiple GPUs works really well. Create separate CPU threads/contexts, one for each GPU, and with a decent vendor GL driver, you get perfect parallelism.

Alfonse_Reinheart · February 9, 2015, 8:28am

To add to what Dark Photon said about multi-GPU, here are the AMD and NVIDIA extensions that allow you to create contexts that are associated with a specific GPU. They both basically do the same job, but they do them in very different ways.

The NVIDIA extension is all about creating a special device context associated with a GPU, using wglCreateAffinityDCNV. These “affinity DCs” are slightly different form a regular HDC, but they can be passed to the standard device context functions. This includes pixel format selection, though the framebuffer bits will generally be ignored. Any HGLRCs created from an affinity DC are affinity contexts, and they can only be made current alongside an affinity DC that uses the same GPU. NVIDIA’s approach based on HDCs allows you to use most of the current WGL infrastructure, changing only how you create HDCs.

The AMD approach is very different. It adds new functions for creating GPU-associated contexts by adding a whole new HGLRC-creation function. It basically bypasses the need for a HDC at all. Of course, this means that you have to use their all-new functions to manage their special HGLRCs.

Neither extension allows you to render to a window directly (the default framebuffer will be incomplete with GL_FRAMEBUFFER_UNDEFINED. However, you can always render to an FBO, then use a normal window HDC and rendering context to blit the image into the window’s default framebuffer.

Dark_Photon · February 10, 2015, 6:05am

And to that, I’ll just note that this vendor-specific sauce is only needed on Windows (WGL window system interface). On Linux/Unix (GLX), X11 has long provided methods for creating content on specified displays/screens, which can be associated with separate GPUs. Then just create a GL window on that display/screen, and render away.

Neither extension allows you to render to a window directly (the default framebuffer will be incomplete with GL_FRAMEBUFFER_UNDEFINED. However, you can always render to an FBO, then use a normal window HDC and rendering context to blit the image into the window’s default framebuffer.

That’s interesting! Didn’t know that.

There’s no such restriction on multi-GPU rendering via Linux/GLX. You can create on-screen GL windows associated with any desired GPU, and then render to them directly, w/ or w/o VSync support.

Alfonse_Reinheart · February 10, 2015, 8:06am

There’s no such restriction on multi-GPU rendering via Linux/GLX. You can create on-screen GL windows associated with any desired GPU, and then render to them directly, w/ or w/o VSync support.

Is there some way to create an off-screen GL context via Linux/GLX?

GClements · February 10, 2015, 1:39pm

A context is neither on-screen or off-screen. But the GLXDrawable to which it is bound can be either a Window (on-screen) or a GLXPixmap or GLXPbuffer (off-screen).

Pixmaps have more restrictions than pbuffers; they typically don’t work with direct rendering contexts, and a context which was originally bound to a pixmap cannot subsequently be bound to a window or pbuffer, and vice versa.