Render pass dependencies and conditional execution

Julian_Edgar · March 27, 2016, 3:48am

If I designate a render pass dependency of a fragment shader then if the geometry shader culls the primitives will the dependency be culled as well? I need to know I am guaranteed that my dependent commands will not execute upon this event.

Alfonse_Reinheart · March 27, 2016, 6:08am

It’s not clear what you’re asking about. Vulkan defines no concept called a “render pass dependency”.

If you’re just talking about whether fragment shaders get invoked if the GS doesn’t output primitives, the answer is yes. Obviously, since without primitives, the FS has no fragments to act on.

If you’re referring to some kind of explicit execution dependency within a render pass, as detailed in chapter 6.4, that’s something else. But again, it’s not clear what exactly you mean.

Chapter 6.4 outlines the meaning of a dependency between two commands. So what are the two commands that have a dependency? And which stages between them have these dependencies?

Julian_Edgar · March 27, 2016, 7:00am

As outlined in section 7.1 Render Pass Creation, there can be render pass subpass execution and memory dependencies. So if I have an execution dependency on a fragment shader stage but this fragment shader stage is not executed due to the geometry shader stage of that pass culling all primitives and thus bypassing the fragment shader stage in this pass then will this other subpass that is dependent of the former passes fragment shader stage be executed when such fragment shader is not invoked? It seems pretty straightforward as if you don’t have a fragment shader processed then this dependent pass would not execute entirely if it were dependent on that stage of execution but I just need to make sure.

For example, I declare two subpasses and the latter execution dependent (VkSubpassDependency::srcStageMask = VK_PIPELINE_STAGE_FRAGMENT_SHADER_BIT) on the formers fragment shader stage. The former pass would not render fragments due to its geometry stage shader culling all primitives. So the latter subpass would not execute its commands?

Alfonse_Reinheart · March 27, 2016, 7:49am

So if I have an execution dependency on a fragment shader stage but this fragment shader stage is not executed due to the geometry shader stage of that pass culling all primitives and thus bypassing the fragment shader stage in this pass then will this other subpass that is dependent of the former passes fragment shader stage be executed when such fragment shader is not invoked?

I think you’re really overthinking this whole thing.

When you create a execution dependency between the FS stage before a certain point (whether it’s a subpass dependency or a pipeline barrier) and the FS stage after that point, what you are saying is that all FS stage executions before that point will have completed before the FS stages after that point are allowed to execute.

“All FS stage executions” doesn’t have a number attached to it. It could be 20. 50. 100,000. Or even zero, if no primitives are ever actually rasterized for whatever reason. The dependency simply says that whatever FS invocations happen before this point will certainly have happened in their entirety before allowing FS invocations after this point.

Commands aren’t conditioned on dependencies. That’s not what dependencies are for.

Julian_Edgar · March 27, 2016, 8:01am

So sadly your’e saying even if executions == 0 then my dependency is processed…? If so then there must be some way to conditionally render commands given runtime metrics. All I have so far is a bunch of indirect draw commands where vertexCount = 0, but I wish to cull all these commands so they don’t need to be evaluated since I know the case and could sweep them all out with a single condition. So I need some way to conditionally execute my commands perhaps by an indirect variable… Indirect commands and some form of conditional command rendering although laid out in static form (to avoid rebuilds) and even looping or jumping would be optimal for Vulkan.

Alfonse_Reinheart · March 27, 2016, 8:37am

All I have so far is a bunch of indirect commands where vertexCount = 0, but I wish to cull all these commands so they don’t need to be evaluated since I know the case and could sweep them all out with a single condition.

I’m going to assume that there’s a good reason why you have “a bunch of indirect commands where vertexCount = 0”. Perhaps you have a fixed buffer size for indirect commands, and you don’t want to do a GPU/CPU sync just to send an exact index to vkCmdMultiDrawIndirect. So you clear the memory to zero before writing the indirect commands to the GPU.

Which leads to the question: so what?

Do you have genuine performance tests that show that these zero-sized draw commands are actively damaging to your framerate in any significant way? If not, I really wouldn’t worry about it. You’re already generating your draw commands on the GPU and shoving them into a command in a (presumably static) command buffer.

I’m sure there’s lower-hanging fruit on the performance tree than the cost of zero-sized indirect commands for you to pick.

Julian_Edgar · March 27, 2016, 8:59am

hahha at “lower-hanging fruit”. There is a worst case of 32767 zero based indirect draw commands, although split up between blits. I just thought that if I could cull my static indirect draw buffers it would be more efficient, obviously. With better cases there are not as much. But I like efficiency. And yes there is a really good reason… And no CPU! It’s hierarchical which is why culling could be optimal.

Julian_Edgar · March 27, 2016, 12:00pm

One reason I am so meticulous on this aspect is that I have a barrier in between each indirect draw, I have load/barrier/access passes. Although each barrier would have no prior writes in these sections when being logically (not physically given your synopsis) culled (vertexCount = 0) I still think that being able to physically cull out whole sections of these contiguous vkCmdDrawIndirect|vkCmdPipelineBarrier sections would undoubtedly be more efficient if conditioned as there can be larger groups (32767 at most) of these couplets of commands being they are hierarchically deduced which being hierarchical is why they can be culled efficiently but there is no Vulkan Command conditioning.

These command sections are conditioned by my shaders so I need a runtime conditioning to direct iteration of my static command buffers as their branching draw cases are static and fixed to memory (prerecorded) but their branching draw logic at runtime is conditioned by their shaders. Looks like I am forced to implement a complete iteration no matter what resulting in a worst case per frame walk of the buffers and barriers regardless of runtime logic. It seems you imply that even though I issue my commands there is no (noticable) “real” overhead penalty where they have vertexCount = 0, but it was the barrier that I was most worried about… Is it low enough overhead to call with no previous writes? (or do you think I am overthinking this all again…? )

I will mostly have “real” work in these compute slices but without culling of the subpasses I am still forced to call them atleast 32767 many times per walk. I haven’t tested any of this yet, rather collecting my apples. My algorithm changes many aspects of graphics preparations and is optimal (derived from ample study) in its own but it requires this conditional command rendering to be most efficient (in practice) resulting in these culls which can be large groups of iterations at times. I have avoided many processes because of it so I am clearly going to stick with it even though that may sound like a lot of draw calls and barriers it is actually minimal given its theoretical throughput and dismissal of companion processes generally required for graphics. It is an algorithm I couldn’t achieve with the classical GL due to its limited STA feature and lack of device side synchronization and execution primitives, that Vulkan provides.

These 32767 iterations are also split across about 16 framebuffer blits. So I am assuming you would say just go for it as GPU’s are monsters anyway. But I want the highest framerate possible and Vulkan could really use conditional/indirect command pipeline rendering primitive extensions.

Alfonse_Reinheart · March 27, 2016, 6:45pm

It seems you imply that even though I issue my commands there is no (noticable) “real” overhead penalty where they have vertexCount = 0, but it was the barrier that I was most worried about… Is it low enough overhead to call with no previous writes? (or do you think I am overthinking this all again…? )

There’s no way to know; it depends on the hardware and the situation. But we do know what the command processor has to do at minimum.

It must read every byte of those commands (so up to 32K * 16-to-20 bytes). And it has to think about executing them. And then it has to not execute anything. It should be noted that it was going to have to do all of those steps anyway, whether those commands were empty or not.

However, if the command processor spends enough time sitting there and reading NOPs, then eventually the entire rest of the pipeline will be empty, waiting for a useful command to come down the queue.

Sometimes, you accept some inefficiency in one place to make the overall framerate more smooth. For example, if I have a particle system with a 20,000 indirect draw call buffer entry, I may not care if 19,950 of those entries go unused. Why? Because I budgeted my particle system’s frame time for 20,000 particles, so 50 + a long pipeline stall will still be well within the frame budget.

But again, it depends on the situation and specifics. It’s not something that you can know a priori. The only effective way to test these sorts of things is to test them on actual hardware. Preferably multiple pieces of actual hardware.

My algorithm changes many aspects of graphics preparations and is optimal (derived from ample study) in its own but it requires this conditional command rendering to be most efficient (in practice) resulting in these culls which can be large groups of iterations at times.

How can an algorithm be “optimal (derived from ample study)” if its performance (in your estimation) relies on a feature that the API you’re using lacks? Indeed, on a feature that no graphics API has.

We’re not talking about simple occlusion queries and conditional rendering here (the latter of which Vulkan 1.0 lacks too). You’re talking about full arbitrary conditional logic within command buffers. You want programmable command processing.

And at no time did Khronos promise anything of the kind for Vulkan.