Multi Threaded Rendering - how do you actually use it?

apologizes in advance if I am missing something totally obvious here but …

I don’t get exactly “the advantage/how you actually take advantage of multi threading”.

Ok first of all “does it make any sense on mobile architecture ?” I mean “does it make any sense/difference if the CPU is NOT truly multi-core” ?

I mean I understand that instead of having ONE thread generating a very complex command buffer or N command buffers full or things, you could have N treads each one generating its own CMD buffer and then in the end have “a submit to the queue of the N cmds”.

In this case, while one thread is “preparing” a CMD buffer so the others can do, in theory let’s say cutting the time to prepare all this by a factor N.

But … if the CPU is single core … I suppose “you are actually faking it” or make it even worse because you have the time to run each “thread” + time to switch between a thread and another that I think “generally is not so small”.

Also overall if I understand “you are not actually cutting rendering time at all”, you are mostly cutting “setup up time” ( i.e. time to prepare all the rendering ) again assuming you truly are on a multi-core cpu ?

I am working on some stuff for Oculus Quest, I don’t think it has a multi-core CPU so I am asking “can I get any benefit in trying to use multi-threading for rendering” ?

Can it really improve things by let’s say instead of having a thread that renders N things try to have N tread each one creating its own CMD buffer and then finally submit them all at once in another thread ?


Most mobile CPUs are multi-core, so your question seems rather moot. According to Wikipedia, the Oculus Quest uses the Snapdragon 835 SOC, which has 8 cores in it.

… I suppose my brain is just melting trying to study too many things of Vulkan at once …

I am just reading/trying to understand better about multi-theading and the use of secondary command buffers.

I should have googled better … … I can’t even go fishing, the weather is horrendous at the moment and I don’t like fishing …

Sorry I think I am just trying to make a sense of “too many things” …


what I am getting a bit “confused” is what actually you achive with multi-threading, I mean “obviously” you can not “speed up GPU” in the sense that “what cmds you subit once they are in the GPU it goes as it goes”.

However if I understand correctly one possible scenario is that instead of having one single thread where let’s say you build CMD1, CMD2, CMD3, CMD4 then “submit to queue” ( that would mean the total time spend to arrive to that “submit” is 1+2+3+4 ) you could have 4 thread each one building in parallel one of those CMDx so the total time to arrive at “submit” would be the MAX(1,2,3,4) which should be, hopefully, less than before.

It seems to me that in both cases part of the rendering “has to wait” for completition of 1,2,3,4 the difference is that with multi-threading it’ll have to wait less time.

Unless of course you are “even more clever” to be able to do with the GPU something else while the CMDx are begin formed.

The thing that also it confuses me a bit is “where actually time is spent while you build up a command buffer ?” I mean all those Vkxxx functions you call before a queue.submit are they going to “spend more time in the driver than the GPU itself ?” I mean it’s more CPU intensive than GPU intensive ?

Because if it’s more CPU usage than GPU “that makes more sense” why multi-threading works, if more time would be spend inside GPU all time you build a CMD buffer then “it would make less sense”, I think.

I am still learning/experimenting, I see “documentation” around that seems to be “adamant” on the fact that “in Vulkan you should always use multi-threading if you can” otherwise “there’s not much gain”.

I’ve seen “demo code” doing stuff like what I described above by literally creating 4 threads, launching them, making them build the CMDx, wait for all of them to finish/exit, submit the thing to the queue ( multithreaded_command_buffers.cpp ).

But I think “that’s only good for a demo”, I suppose in a real situation you want those threads always to remain in some running state and synchronize that with some other means otherwise I suppose lot of time would be waste in launching/stopping threads.

But yeah what is/was puzzling me is fundamentally “where you are actually going to save/improve time by using multi-threading ?” and I suppose one possible answer is “exactly in building those CMDs” ?


First, “please” stop “putting random phrases” in quotation “marks”. It makes it “really difficult” to “read” your posts. If you need to emphesize a statement, you have a multitude of formatting options to choose from (though you should also remember that the more text you emphasize, the less emphasis each use of said emphasis has). If you aren’t quoting someone other than yourself, you shouldn’t be using quotation marks.

What do you mean by “part of the rendering”? The act of submission of CBs to the queue does require that all threads have generated their CBs, obviously. So whichever thread is responsible for queue submission must hypothetically wait for the other threads to finish with their CBs.

However, waiting does not mean that nothing can be done during that time.

If thread 1 is responsible for submitting the CBs, and some other thread isn’t finished generating CBs yet, thread 1 can instead spend some time working on some other task. This requires developing a task-based system that can pass around small, locked-off tasks which can be executed on whichever thread has some time to spare.

You wouldn’t want to use pre-emptive multitasking here; you’d instead be relying on cooperative multitasking. At the end of the small task, the system checks to see if the other CBs are done, and if so, it switches back to the submission task. If it still needs to wait, then it executes another small task.

And threads 2-4, having finished with rendering work, can also be pulling tasks from the task manager.

That’s not “more clever”; that’s just how rendering is supposed to work.

The general expectation is that while you’re building the CBs and GPU data for generating frame 12, the GPU is busy rendering frame 11. You never have the CPU wait for the GPU to finish the previous frame. Not unless you’re doing a full teardown of the Vulkan device (because you’re closing your application or lost the device or whatever).

This is one of the advantages of Vulkan compared to something like OpenGL and the like. In OpenGL, if you look at pretty much every tutorial out there which tries to update data on the GPU, unless it’s a tutorial specifically about streaming data efficiently, it will just call functions like glBufferSubData or whatever without regard to anything else. If you attempt to overwrite data that’s being read from in the current frame, the function will force a full GPU/CPU synchronization operation without you being aware of it. It’s perfectly valid OpenGL code, but if you want performance, you have to know to avoid doing those things.

By contrast, if you tried the equivalent in Vulkan, you will have encountered undefined behavior. It wouldn’t be perfectly valid for the CPU to try to upload to a buffer that the GPU has not finished reading. If you want to do a CPU/GPU sync, you have to actually do it. And thus, you know exactly what you’re getting into.

Just because you submit work to the GPU doesn’t mean the GPU is finished with it when vkQueueSubmit returns.

That’s broadly true, but not entirely.

Vulkan is more low-level, so you get to do things like play around with memory storage directly. This lets you do streaming data gymnastics much more easily in Vulkan than in a more abstract API like OpenGL. Vulkan lets you play around with the bytes of memory more readily, so if you have a program that just needs to write some pixels into memory and blast them to the screen, Vulkan is really good at that.

But overall, the primary reason Vulkan et. al. exist is for threading. Those other things were just problems that could be solved along the way.

The thing you may not know is that GPUs have gotten faster faster than CPUs have gotten faster. The time it took to feed GPUs was regularly exceeding the time the GPUs took to digest the food. This meant that your shiny new GPU was probably spending most of its time with its mouth open waiting for food.

Coupled with that was that single threads of CPUs weren’t getting faster by all that much. CPUs got “faster” by having more threads, not by making a single thread that much faster. OpenGL (and D3D pre-12) were pretty rigidly single-threaded. You could offload some basic memory tasks to other threads, but they had no effective way of having multiple threads build rendering operations effectively.

Vulkan restores the balance by allowing CPUs to more efficiently feed GPUs.

1 Like