Does a Semaphore Signal Operation Completely Drain a Queue?

The specification goes like this in chapter 7.4.1. Semaphore Signaling:

The first synchronization scope includes every command submitted in the same batch. In the case of vkQueueSubmit2, the first synchronization scope is limited to the pipeline stage specified by VkSemaphoreSubmitInfo::stageMask . Semaphore signal operations that are defined by vkQueueSubmit or vkQueueSubmit2 additionally include all commands that occur earlier in submission order. Semaphore signal operations that are defined by vkQueueSubmit or vkQueueBindSparse additionally include in the first synchronization scope any semaphore and fence signal operations that occur earlier in signal operation order.

Paraphrased, I read this like follows:

  • The whole batch must finish execution before the semaphore is signaled (okay, that’s totally expected).
  • Every command that has ever been submitted to the same queue earlier (i.e., including every command not included in the current batch) must finish execution before the semaphore is signaled.
  • Every semaphore signal or fence signal (even if it refers to totally different semaphores/fences than used in the current batch) must have been signaled before any semaphore of the current batch is signaled.

Did I get this right?
If I did, that means that semaphores employ very heavy synchronization, because a semaphore is not allowed to be signaled if there are still any previous commands executing (from previous submissions, even if they are totally unrelated).

If I got it right, I guess the only way to support more parallelization is to use multiple queues, right?

It is also stated in chapter 3.2.1. Queue Operation:

Before a fence or semaphore is signaled, it is guaranteed that any previously submitted queue operations have completed execution, and that memory writes from those queue operations are available to future queue operations. Waiting on a signaled semaphore or fence guarantees that previous writes that are available are also visible to subsequent commands.

So, I guess I got it right.

The requirement that each and every previously submitted command must have completed execution and that there cannot be parallelization of work w.r.t. a totally unrelated set of commands seemed just a bit too heavy to me. But looks like this is how it is.

vkQueueSubmit2 allows semaphore signaling to be scoped with a stage mask, but otherwise, yes.

That doesn’t make sense. I mean yes, they are “heavy” in the sense that they don’t specify a stage mask (though again, vkQueueSubmit2 allows such a thing). But events and pipeline barriers also wait for the execution of commands from previous submissions to reach their stage mask too.

I mean yes, that’s correct. But remember: semaphore signal operations are meant specifically for inter-queue synchronization: synchronization between multiple queues or between queues and external stuff (like the display engine). Indeed, a batch cannot wait on a semaphore signaled on the same queue at all (unless its a timeline semaphore), since you’re not allowed to submit a batch unless the batch that signals the semaphore was already submitted.

So it’s not clear what the concern here is.

What I mean is the following:

Imagine, we have submitted a COMPUTE shader command which takes forever to compute.
Afterwards we submit a small and totally unrelated COPY comand to the same queue and we signal a semaphore upon completion, and we assume that COPY completes earlier than COMPUTE.

I.e., with vkQueueSubmit, the signal after COPY waits for COMPUTE to finish execution before signaling, right?

And with vkQueueSubmit2, the signal after COPY can be performed before COMPUTE has finished, because vkQueueSubmit2 allows to limit the first synchronization scope to the COPY stage, is that right?

So it’s not clear what the concern here is.

The concern here is if I understand the specification right.

Hmm, on second thought, I think I described the difference between vkQueueSubmit and vkQueueSubmit2 wrongly. Because I interpret this part of the specification:

The first synchronization scope includes every command submitted in the same batch. In the case of vkQueueSubmit2, the first synchronization scope is limited to the pipeline stage specified by VkSemaphoreSubmitInfo::stageMask .

So that vkQueueSubmit2 only limits the first synchronization scope of the batch submitted.

But the next sentence in the specification:

Semaphore signal operations that are defined by vkQueueSubmit or vkQueueSubmit2 additionally include all commands that occur earlier in submission order.

would indicate that in my previous post’s example, both vkQueueSubmit and vkQueueSubmit2 require COMPUTE to finish before the COPY batch can signal, even if the COPY batch finishes way earlier than the COMPUTE batch.

So, if this ^ is true, then this is what I meant by “heavy”. No concern, actually. Just wondering why it has to be so heavy to also include all the previous commands.

If I got it right, I guess the only way to support more parallelization is to use multiple queues, right?

What more parallelization would you even want? You have only one GPU. It cannot compute two GPUs worth of work in parallel. At most you get concurrency, not parallelism.

I’m just trying to get the specification totally right.

I would not assume that.

While queues are in fact allowed to execute commands in an arbitrary order (sans explicit synchronization), this should not be used to assume that a GPU queue is some kind of magical box that will read far ahead in a queue to find all of the commands that can execute out-of-order and then do so.

That’s not to say that two fundamentally different kinds of commands will never execute out of order. My point is that, outside of profiling on specific GPUs, you shouldn’t stake your application’s performance on the GPU doing substantial reordering of operations on a single queue.

If you have a transfer operation that is completely independent of a compute operation, you should probably avoid putting them on the same queue unless you have a specific reason to do so (the device only having one queue, not wanting to do memory queue ownership stuff, etc).

Yes, but what does that matter? A batch of work signals a semaphore if someone else is going to wait on that semaphore, yes? So is the code waiting on that semaphore waiting for the compute operation or the transfer operation? If these truly are independent operations, why do you force the waiting code to wait for both of them to complete (which is what you did by putting it on the same queue)?

Right. Well, as said above. Semaphore is a more brutal barrier. And barrier already covers everything in submission order. So yea

It does not exactly completely drain a queue. One has to carefully read the spec of “submission order” what is covered and what isn’t. Largely things that have aspects that might be external to a Queue might not be covered, such as swapchain and sparse bind.

Though it is largelly what “queue” and “semaphore” means semantically. Imagine you submit a semphore into the operation stream (i.e. queue). How would you even specify to the semaphore “cover this, but don’t cover that, but I want to cover this but maybe not this”. No. The concept of semaphore is pretty much the same since Dijkstra. Basically at some point in the queue SEMAPHORE_RELEASE semaphore1 op will be executed, and that’s it.

If one wanted point-to-point synchronization of sorts, there are Events. But they are no silver bullet, and IMO one has to be extra careful not to hurt himself with them (and measure). They are light in the sense they are more specific what they synchronize, but they are heavy in the sense they might have larger overhead than Pipeline Barriers.

Yeah, I totally agree. And thanks for your exaplanation.
My intention with this thread here is just to understand the specification totally right.

So is the code waiting on that semaphore waiting for the compute operation or the transfer operation?

On the transfer operation (more precisely: on the COPY)!

(Sorry, I messed the description up at one point above. It is corrected now. But I think we have clarified with this post anyways.)

If these truly are independent operations, why do you force the waiting code to wait for both of them to complete (which is what you did by putting it on the same queue)?

Yeah, that’s exactly my question: Do I force the waiting code to wait for both of them by putting them on the same queue? Even if I specify that the first synchronization scope is limited to the COPY stage (using vkQueueSubmit2)?
(And assuming that COMPUTE takes longer and furthermore assuming that the GPU/driver schedules the commands so that they are performed concurrently.)

Yeah, thanks for pointing this out – I wasn’t totally precise with my question title.

But I think, a semaphore signal operation drains everything that has been submitted to the queue before the batch with the semaphore signal operation was submitted (but not the batches that were submitted afterwards, and might already have started executing).

Do I force the waiting code to wait for both of them by putting them on the same queue?

Yes, as stated by submission order. Every previous vkQueueSubmit command is covered by submission order. So if a synchronization primitive invokes that submission order is part of its synchronization scope, then all those vkQueueSubmitions are included.

There would be some leeway in as-if principle. But in Vulkan, driver implementations are disincentivized to be too smart.

But I think, a semaphore signal operation drains everything that has been submitted to the queue before the batch with the semaphore signal operation was submitted (but not the batches that were submitted afterwards, and might already have started executing).

Not only that. Note a suspicious lack of vkAcquireNextImageKHR and vkQueueBindSparse in the specification of submission order.

There is a mundane reason for it. Swapchain might need to invoke system calls that might have nothing to do with queue. And similarly sparse is basically virtual memory, which equally might need OS calls that have no concept of Vulkan’s queue at all.

So, a semaphore signal might not include really everything that was in the queue before.

Semaphore signal of vkQueueBindSparse instead invokes a separate concept of signal operations order. So vkQueueBindSparse's implicitly covers any previously vkQueueSubmited semaphore signal, but not any batches without semaphore nor semaphore signals of previous vkQueueBindSparses.

2 Likes