Does Vulkan parallel rendering relies on multiple queues?

I’m a newbie of Vulkan, and not very clear on how parallel rendering works, here’s some question (the “queue” mentioned below refers specifically to the graphics queue:

  1. Does parallel rendering relies on a device which supports more than one queue?
  2. If question 1 is a yes, what if the physical device only have one queue, but Vulkan abstracted to 4 queues (which is the real case of my macbook’s gpu), will the rendering in this case really parallel?
  3. If question 1 is a yes, what if there is only one queue in Vulkan’s abstraction, does that mean the device defiantly can render objects in parallel.

P.S. About question 2, when I use Metal api, the number of queues are only one, but when using Vulkan api, the number is 4, I’m not sure it is right to say “the physical device only have one queue”.

Thanks for helping, let me make my question more clear:

  1. When comes to the concept “parallel” here, I mean parallel “tasks” or parallel “workloads”
  2. Do you suggest that multiple “tasks” or “workloads” is not that necessary, since once a command_buffer is submitted, it will likely take all the computation resource, leads to parallel tasks actually run one by one?
  3. Do you suggest that queues are not for parallel tasks, but for easy working with threads?

That doesn’t really clear anything up. What do you mean by a “task” or “workload” in the context of rendering?

Well, I use the word “workload” because I read this:

GPUs have proven extremely useful for highly parallelizable data processing use-cases. The computational paradigms found in machine learning & deep learning for example fit extremely well to the processing architecture graphics cards.

However, when it comes to multiple GPU workloads, one would assume that these would be processed concurrently, but this is not the case. Whilst a single GPU compute workload is parallelized across the numerous GPU cores, multiple workloads are run one by one sequentially. That is of course until recent improvements in graphics card architectures which are now enabling for hardware parallelization across multiple workloads. This can be achieved by submitting the workloads to different underlying physical GPU “queue families” that support concurrency. Practical tecniques in machine learning that would benefit from this include model parallelism

If I use this term with rendering makes confusion, I can change to anther example:

If my device have two transfer queues in a transfer family, what’s the behaviour when sending two vertex buffers through two queues saprately vs sending two vertex buffers one by one through a same queue? (Sending vertex buffer seems not the right term, newbie here…)

Or, if I’m doing a off-screen rendering job, which render a object with different lights, into two different images, these two independent rendering job is the “tasks or workload” I mean.

That’s different.

If hardware offers a dedicated transfer queue, that usually represents specialized hardware for doing DMA, one that’s typically separate from other DMA channels. So if there’s a transfer queue family, and that family offers 2 queues, it’s reasonable to assume that this represents two different pieces of DMA hardware that won’t significantly compete with each other for transfer resources.

When it comes to graphics and compute tasks, they will compete with each other for execution resources. There are only so many compute cores available, and such tasks will use as many as are available.

The advantage to more advance queueing is primarily about more efficient use of available resources. Compute tasks in particular often have sharp dependencies between sequences of tasks. Consider having compute operations A and B submitted in sequence, such that B depends on A completing.

There will likely be some unused cores between these two operations. If you had some other compute operation C that had no dependency on A or B, you could insert C between them to keep the compute units filled. However, this is a manual process that requires the code which builds the command buffer containing A and B to know about C and submit it appropriately. That can create significant complexity.

Having the GPU deal with that by submitting A/B and C on different queues makes for easier CPU management.

1 Like

So in practice, is it true that using multiple transfer queues for data transferring, while use only one graphic queue for rendering in most case?

You probably mean concurrent, not parallel. Your quote also says concurrent. It means you are able to submit to multiple queues asynchronously (i.e. from multiple threads). Nothing more, nothing less.

If my device have two transfer queues in a transfer family, what’s the behaviour when sending two vertex buffers through two queues saprately vs sending two vertex buffers one by one through a same queue? (Sending vertex buffer seems not the right term, newbie here…)

Implementation dependent. Likely no different. Doubtful drivers would penalize you if you use only one transfer queue.

Or, if I’m doing a off-screen rendering job, which render a object with different lights, into two different images, these two independent rendering job is the “tasks or workload” I mean.

They compete for the same underlying resources. It is probably no different or worse on GPU. Implementation dependent, but sane GPU driver would either serialize it anyway or pre-empt it, instead of running it parallel. Might be better on the CPU side though with the use of multiple cores to generate the independent command buffer and independently submiting them. That though assumes CPU was your bottleneck in the first place. Meanwhile you can record command buffers asynchronously even if you are submitting to a single queue; you just need to join the threads before submitting it all.

1 Like