I would like to do onscreen and offscreen rendering in parallel. The onscreen job is relatively fast (e.g. 3 ms / frame). The offscreen job is relatively slow (e.g. 100 ms / frame). Both happen periodically, but only the onscreen job is strictly time critical (locked to refresh rate). The offscreen job is updating data for the onscreen job, but it doesn’t matter if it takes a little bit more of less time to get it done.
Continuing with example values:
With a 144 Hz screen the frame period is 6,94 ms. That leaves an excess of 3,94 ms of GPU time per onscreen frame.
Is it possible to utilize that excess GPU time to continue the offscreen rendering job in the background without disrupting the onscreen rendering (i.e. time sharing)? I’d also be fine with any kind of fixed computing resource partitioning.
A CPU world analog would be doing the work in realtime priority foreground and low priority background threads without caring if it’s actually getting done on two separate cores or on one context switching core.
All I can think of is using a frame queue. Unfortunately I need to avoid extra latency. Also on average I would need to store 100 / 6.94 = 14.4 frames. At 1080p that’s 117 MB memory wasted.
OpenGL’s internal command queue does not feature preemptive multithreading. You could create two contexts and try to shove commands at both simultaneously, but that will only end up with the driver executing one side’s commands first, then the other (and not in a task-switching, priority-based way). Even in multi-GPU contexts, implementations use the two GPUs to execute the same commands (either in alternate-frame rendering with one GPU chewing on the last frame’s data, or in interleaved rendering with both GPUs executing the same commands).
If you have two GPUs, you could use one of the proprietary extensions for creating GPU-specific contexts. Then you’ll be able to have real parallelism. But otherwise, you’ll have to explicitly do task switching yourself.
Actually, it’s the massively parallel nature of the GPU that makes it a bad match for doing that.
Consider what you would have to do to task-switch a GPU the way you do a CPU. You have to stop dozens-if-not-hundreds of computational unit, figure out exactly where each computational unit is within its own shader code, copy several Megabytes worth of local storage out to memory (uniforms and such), copy that shader code to memory, preserve dozens-if-not-hundreds of various fixed-function registers (viewport, blending, texture bindings, etc), and so forth. Then it has to copy in all of the stuff needed for the other rendering operation, then start the pipeline up.
There’s no way that’s going to be fast. Especially since you basically have to shut down all rendering to do that.
OpenCL could handle something like this because all it does is compute. It doesn’t have to use the fixed-function parts of the pipeline. So it’s possible to have multiple compute operations, where you dedicate some percentage of computational resources to individual processes.
That doesn’t mean that something similar would be fundamentally impossible on GPUs. But it would require specialized hardware to handle.
Okay, I guess GPU context switching does sound kind of redicules. What about resource partitioning? Say split all the stream processors, ROPs and texture units but share the memory? Would you say that could be done on existing hardware?
Bindless texturing would take care of the texture state, and ROPs are already scaled up. But you’d still have some problems.
First up is primitive assembly. For point of reference, it was not so long ago (I think the Radeon 6xxx series) where AMD was touting the fact that their hardware has dual primitive assembly units. So even if high-end hardware had 3-4 assembly units, you could only dedicate a minimum of 25% of your throughput to a secondary task. And given the disparity in how much GPU time you want to give your background process, that probably is too much to dedicate to it.
Another problem is the command processor itself. Compute operations are pretty simple to execute: you pick some number of shaders to execute them on, and you fire them off. So you could reasonably write compute operations for different sets of shader resources, since each compute operation is an island.
For rendering commands, you would effectively want two separate contexts, with two separate command queues that deal with two separate sets of global state. It would be very difficult to emulate that with a single command processor.
And then you’d have to deal with the vertex puller. While apparently AMD’s hardware doesn’t have dedicated vertex pulling logic anymore, that’s far from true for everyone’s hardware. Such dedicated hardware would probably not be designed to simultaneously handle two separate rendering operations. It, like the command processor, would be intended to operate single-threaded.
snoukkis, now that you understand the single-GPU situation, I’ll just mention that if you need to do something like this, multiple GPUs works really well. Create separate CPU threads/contexts, one for each GPU, and with a decent vendor GL driver, you get perfect parallelism.
To add to what Dark Photon said about multi-GPU, here are the AMD and NVIDIA extensions that allow you to create contexts that are associated with a specific GPU. They both basically do the same job, but they do them in very different ways.
The NVIDIA extension is all about creating a special device context associated with a GPU, using wglCreateAffinityDCNV. These “affinity DCs” are slightly different form a regular HDC, but they can be passed to the standard device context functions. This includes pixel format selection, though the framebuffer bits will generally be ignored. Any HGLRCs created from an affinity DC are affinity contexts, and they can only be made current alongside an affinity DC that uses the same GPU. NVIDIA’s approach based on HDCs allows you to use most of the current WGL infrastructure, changing only how you create HDCs.
The AMD approach is very different. It adds new functions for creating GPU-associated contexts by adding a whole new HGLRC-creation function. It basically bypasses the need for a HDC at all. Of course, this means that you have to use their all-new functions to manage their special HGLRCs.
And to that, I’ll just note that this vendor-specific sauce is only needed on Windows (WGL window system interface). On Linux/Unix (GLX), X11 has long provided methods for creating content on specified displays/screens, which can be associated with separate GPUs. Then just create a GL window on that display/screen, and render away.
A context is neither on-screen or off-screen. But the GLXDrawable to which it is bound can be either a Window (on-screen) or a GLXPixmap or GLXPbuffer (off-screen).
Pixmaps have more restrictions than pbuffers; they typically don’t work with direct rendering contexts, and a context which was originally bound to a pixmap cannot subsequently be bound to a window or pbuffer, and vice versa.