Stencil Routed K-Buffer optimization with multisample and stencil test

AlSilvae · January 3, 2020, 8:53am

Hi!
I was able to successfully implement the stencil routed k-buffer algorithm for transparent objects.
Here you can find the short paper with the explanation.

The algorithm uses multisample and the stencil test.
I found out that my frames drop drastically due to the following group of instructions in my render loop:

const uint32_t m_nTransparencyLayers = 0x1u << 3;//I use 8 transparency layers

//I need to clean my stencil buffer each render loop
for (uint32_t i = 0; i< m_nTransparencyLayers; i++)
   {
      glStencilFunc(GL_ALWAYS, i + 2, m_nTransparencyLayers - 1);
      glSampleMaski(0, 0x1u << i);
      m_pFrameBuffer->DrawBlank(m_pShaderBlank);
   }

Comments on the last function (DrawBlank):

I bind a custom frame buffer if it is not, it uses multisample, according to the algorithm
I draw a full screen quads, the vertex shader only outputs the four vertices, the fragment shader outputs black
I need to call this function in order to “flush” for each sample (enabled using glSampleMaski) the correct stencil reference value set with glStencilFunc

The algorithm works perfectly but calling this “cleaning” loop kills my performance.

If you have any tips/suggestions I will really appreciate.
Thanks

Dark_Photon · January 4, 2020, 2:05am

Ok. What GPU, GPU driver, and driver version?

What does this mean? You bind a custom framebuffer (FBO) if one is not already bound?

Binding a framebuffer is an expensive operation, particularly on mobile GPUs. You should not bury this in an internal function like this. I’d move it up-top just above your FBO clear, before you do this “seed the stencil buffer” loop on that FBO.

Alg ref:

Stencil-routed A-Buffer (PPT) (SIGGRAPH 2007, NVIDIA, Myers and Bavoil)

AlSilvae · January 7, 2020, 8:13am

Hey! Thanks for your answer.

So here my infos:
My GPU: AMD Radeon ™ R5 M330
My Driver Version: 19.50.02-191204a-349781C-RadeonSoftwareAdrenalin2020

What does this mean? You bind a custom framebuffer (FBO) if one is not already bound?

Sorry, my mistake, you are right.
My DrawBlank function is actually this one:

   pShader->Use();
   glBindVertexArray(m_nCanvasVAO);//the VAO is for a full quad for the screen
   glDrawArrays(GL_TRIANGLES, 0, 6);
   pShader->Unload();//this shader only outputs the vertices for the quad screen

I’d move it up-top just above your FBO clear, before you do this “seed the stencil buffer” loop on that FBO.

I did it, I forgot to report it. I bind and unbind this FBO only twice in my render cycle.

Alg ref:

Stencil-routed A-Buffer (PPT) (SIGGRAPH 2007, NVIDIA, Myers and Bavoil)

Thanks for the resource, I used that presentation plus the paper to implement the algorithm

Currently what I have tried and I will:

update the driver (same performance problem)
test the code on NVidia
last option that I am evaluating: implement a new algorithm, Per Pixel Linked List, and compare the performances.

Dark_Photon · January 8, 2020, 1:46am

Ok. Reading around, it sounds like this is basically a low-end mobile GPU used on a entry-level laptops 5 years ago. It’s low-power and backed by slow DDR3, which according to some sources, makes its 14.4 GB/sec memory bandwidth its Achilles heel.

This low memory bandwidth may very well be a problem for the Stencil-routed A-buffer technique you’re trying to use. For these clear passes, it explicitly performs N fullscreen passes (N being the number of MSAA samples; 8 in your case), writing individual sub-samples in each pixel across all pixels. Each of these write passes is going to completely defeat MSAA bandwidth compression for every pixel! As I understand it, that makes the pixel sample data more expensive to read and write later.

On a fast discrete GPU using GDDR and a high-bandwidth memory bus, this technique’s clear passes are less likely to be a performance limiter. But on your low-end GPU using slow DDR, depending on the driver, it may very well be (and your results are suggesting that it is).

Even on mobile tile-based GPUs, this kind of thing would be a killer if the GL-ES driver breaks render passes at sample mask changes. There you’d want to seriously look at using pixel local storage to avoid this, if possible (to avoid all the extra memory bandwidth to/from slow DDR for these N sequential render passes, each with it’s own full-screen write [and possibly read] of memory bandwidth).

On your GPU+drivers with this technique, your best bet to speed up those clear passes with this technique is probably to either: 1) reduce your screen resolution, 2) reduce your MSAA sample count (e.g. 8->4->2), or both! If you do this and see a roughly linear change in perf w.r.t. the number of pixels or samples you’re writing, then you’re likely fill limited.

…

By the way, if this low-end AMD GPU is switchable between an integrated Intel GPU, and you’re running Windows, set the Power Plan to High Performance to prefer the AMD. But even then, I wouldn’t expect much from this.

I would be sure to test it on an mid- or high-end NVidia discrete GPU, and a mid- or high-end AMD discrete GPU if you have one. That is, a GPU add-on card with its own high-speed GDDR memory, memory bus, and GPU cores.

Dark_Photon · January 8, 2020, 2:31am

Another idea for you…

Rather than doing those 8 “clear” passes at the top of your frame “every frame” to seed the stencil buffer values…

What you might do instead is, on init, just do these 8 stencil “clear” passes in some saved stencil buffer (e.g. to an MSAA renderbuffer or MSAA texture attached to FBO #1). And then at the start of each frame, just copy that saved stencil buffer over to the stencil buffer for the FBO (call this FBO #2) that you’re rendering to this frame to initialize it. This should at least reduce those 8 fullscreen write passes every frame down to 1 pass.

For best perf, you may want to make the internal format of this saved renderbuffer or texture GL_DEPTH24_STENCIL8, as that tends to be accelerated.

There are a couple ways you can do this copy, glBlitFramebuffer() being one of them. Just use the method that’s fastest on your GPU+driver.

AlSilvae · January 8, 2020, 4:13pm

I really appreciate your answer, thanks.

I tried both, of course I obtain better performance but due to my application purpose I cannot reduce the resolution and with only 2 MSAA samples I loose important “transparent” fragments.
I know, it’s a trade off between performance and quality

Yes, I always test my application with High Performance in my AMD settings.

Interesting update.
For sure MSAA is slowing down my application but I found out that also texelFetch instruction in my fragment shader has responsibility in it:

uniform sampler2DMS depth;
vec4 c0 = texelFetch(depth, ivec2(TexCoords * texSize), 0);  //  subsample 0

sampler2DMS depth is coming from my FBO and it is created as:

GLuint m_nDepthId;
glGenTextures(1, &m_nDepthId);
glBindTexture(GL_TEXTURE_2D_MULTISAMPLE, m_nDepthId);
glTexImage2DMultisample(GL_TEXTURE_2D_MULTISAMPLE, 8, GL_DEPTH24_STENCIL8, m_nWidth, m_nHeight, GL_TRUE);
glFramebufferTexture2D(GL_FRAMEBUFFER, GL_DEPTH_STENCIL_ATTACHMENT, GL_TEXTURE_2D_MULTISAMPLE, m_nDepthId, 0);//here I add the multisample depth-stencil to my FBO and the FBO is completed successfully

I think texelFetch is performing a conversion from the internal depth-stencil format to vec4.
In fact if I perform the same texelFetch but for “normal” sampler2DMS texture, with rgba format, I obtain improvements.

Yes, this is really interesting, thanks. I will try it for sure.
Silly, question, I don’t understand if it is possible to use glBlitFramebuffer operation also for MSAA stencil. Can you clarify the this operation for MSAA or add other methods that I can test?
Thanks

Dark_Photon · January 9, 2020, 1:59am

Sure. Something like:

glBindFramebuffer( GL_READ_FRAMEBUFFER, fbo1 );
glBindFramebuffer( GL_DRAW_FRAMEBUFFER, fbo2 );
glBlitFrameBuffer( 0,0,w,y, 0,0,w,h, 
                   ( GL_DEPTH_BUFFER_BIT | GL_STENCIL_BUFFER_BIT ),
                   GL_NEAREST );

As I recall, you can copy from MSAA to MSAA so long as the formats match and the src and dest rectangles are the same size. But I’ve not actually tried that. See Framebuffer#Blitting in the OpenGL wiki.

Alternatively, you can use glBlitNamedFramebuffer() (from DSA). This avoids having to explicitly bind the framebuffers to the context to do the blit.

Also, you might check out glCopyImageSubData() for an alternate copy method. I haven’t dug into it to be sure it supports MSAA and/or depth+stencil formats, but it might. If it does, it may not be as efficient as it supports format conversions between the source and dest internal formats if they differ. However, there may be a fast path provided for when the formats match.

Those are just two ideas.