Stencil-based ordering performance best practice

Greetings,

I’m preparing to port an OpenGL ES 3.1 application to Vulkan and there is a particular scenario where I’d like to determine the most performant action first:

I issue multiple drawcalls from the same VBO, sometimes the same range, sometimes different. There has to be an absolute ordering between pixels of different drawcalls that means a pixel of a certain drawcall may never appear in front of a pixel of another certain drawcall. I cannot use depth buffer for it because it does other things. For the ordering, I use glStencilFunc with a certain reference value and GL_GREATER/GL_LESS. Multiple VBOs are rendered per frame, data from different VBOs may use the same reference value. Problem is that I issue a lot of drawcalls and each each call takes a glStencilFunc update which causes lots of CPU overhead. For the original OpenGL app an alternative has been devised that fetches pixels from the framebuffer using GL_EXT_shader_framebuffer_fetch then discards pixels from the GLSL shader. Since I can avoid glStencilFunc calls now and there are no other state changes this allows me to use 1 drawcall per VBO.

Now for the planned switch to Vulcan I learned two things: First, that draw commands are supposedly CPU cheaper than in OpenGL. Second, that it is possible to define the dynamic stencil ref state of a pipeline using VkDynamicState, VK_DYNAMIC_STATE_STENCIL_REFERENCE and vkCmdSetStencilReference.

So my question is, what do you think would be more performant for the upcoming Vulkan implementation:

  1. Port the blending/GL_EXT_shader_framebuffer_fetch based solution to Vulkan with one drawcall per VBO.

  2. Port the original solution with multiple drawcalls per VBO using VkDynamicState and replacing each glStencilFunc call with a vkCmdSetStencilReference call.

I hesitate using the GL_EXT_shader_framebuffer_fetch solution unless it is absolutely necessary because it is not available on all platforms, discard might be GPU expensive and I have to take additional care how the framebuffer is used so I’d be quite happy if Vulkan/VkDynamicState made it superfluous.

From the following link (on the ARM website), it sounds like to get the Vulkan equivalent of OpenGL ES’s EXT_shader_framebuffer_fetch / ARM_shader_framebuffer_fetch_depth_stencil and EXT_shader_pixel_local_storage, you encode each group of draw calls in Vulkan as separate sub-passes within a single, shared renderpass.

As the articles describes, all subpasses within a render pass will be iteratively rasterized for a single screen tile using shared on-chip cache for the tile framebuffer. So subsequent subpasses can cheaply read the framebuffer results from prior subpasses in that same render pass via the on-chip tile framebuffer.

Interesting, I shall have a look at it.

What remains is the second part of the question: On Vulkan, will

  1. a two-pass approach, that discards pixels in the second pass based on the results from the first pass with one drawcall per pass

be faster than

  1. a one-pass multiple-drawcall approach that only switches stencil reference value between drawcalls using VkDynamicState.

?