Per-sample operation performance problem for deferred shading with MSAA

Hi all:

I am a beginner who is learning OpenGL in order to make a good looking video game. Currently I am working on a deferred shading process to get ready for SSAO, and it’s getting pretty tricky.
I am rendering to multiple textures that store screen’s position, normal, and material information for the geometry pass, then do lighting calculation using the geometry information from the geometry pass. but I am also using MSAA throughout my program. I can’t resolve my multisampled buffer too soon; if I resolve it before the lighting pass, then the geometry information of the screen I get during lighting pass will be no different than if I weren’t using MSAA.

So I did the lighting pass on a per-sample level in the fragment shader in order to defer MSAA resolution until I finish my lighting pass with the geometry buffer. Even without adding SSAO, lighting pass alone takes 3ms on my GTX 660. So I was wondering if I am doing something wrong or is having a good performance for MSAA & deferred shading a lost cause? It’d be an appropriate time to ask the experts on the forum.

Here’s how I’m doing deferred shading while still trying to do MSAA:

For geometry pass:

  1. Initialize a framebuffer with three multisampled high-precision color attachments.
  2. Each frame, I bind the framebuffer, specify all three attachments as the render targets, and draw position,normal, and material information out to the three textures simultaneously.

For lighting pass:

After that, I do a lighting pass, which is slow because the operation is per-sample. And all it’s doing is to bind all the color attachments of the geometry pass framebuffer as sampler2DMS textures, and I read from them with texelFetch(). I use the fetched samples to do lighting calculation for each sample. I forced fragment shader to run per-sample by using gl_SampleID. After that I just call glBlitFramebuffer() to let OpenGL handle the resolution by blitting lighting pass buffer into a non-MS buffer. Then I do other post-processing from there.

The previous two operations (geometry pass and lighting pass) cost 6.5ms in total on my GTX 670, and is a hideous 20ms on my laptop that has integrated Intel chip. Am I doing something wrong or is doing per-sample operation just that expensive? I’d love to hear your opinion, thanks a bunch!

That’s one option. Another is to render to a single-sampled framebuffer in your lighting pass, but do multiple taps (multiple texelFetch()s) in the lighting shader to read multiple samples per pixel from your MSAA G-Buffer, do your lighting calc for each, and blend the results, and write that out. Then you don’t need to do per-sample shading, yielding a lot fewer frag shader executions and less lighting buffer read/write bandwidth.

The previous two operations (geometry pass and lighting pass) cost 6.5ms in total on my GTX 670, and is a hideous 20ms on my laptop that has integrated Intel chip. Am I doing something wrong or is doing per-sample operation just that expensive?

It’s not cheap, and is only aggravated by blend cost and overlap between blends.

Question: How are you rendering your lights? Are you binning them at all by screen tile (tile-based deferred shading) or is there potentially a lot of overlap in the bounds primitives you’re rendering to the lighting buffer? Anything you can do to reduce that overlap will help. Also, consider ditching per-sample shading altogether, unless you’re sure you need it.

Re GTX 670, that GPU isn’t very recent so that could be part of the issue there. Also, are you using a lot of batches and state changes to rendering your lighting pass? If so, ask about ways to reduce that to minimize driver overhead and let the GPU run as fast as possible.

Finally re integrated Intel, … I’m not for sure but I think this probably a tile-based (sort-middle) GPU like mobile GPUs, since it uses slow DRAM for rendering. There’s a whole set of special techniques you need to use to get good performance out of those, particularly when doing multipass techniques like deferred shading. You want to investigate ways to keep the G-buffer on-chip in the tile buffers for best performance. That yields huge speed-ups. Just ask if you’re interested in more detail here.

Hi Dark Photon:

Thank you for your reply, that was really helpful, now I can see how to make it run faster. And yeah I’m binning them all at screen tile and do the lighting pass with one draw call.

I’ve never dealt with mobile GPUs or anything similar to that, so I’d love to hear about the special techniques for speeding up integrated Intel chips. could you elaborate on that or share some links? That’d be really helpful, thanks!

I figured before I started filling your head with GL tricks for optimizing for tile-based GPUs, I should make sure that Intel HD IGPs really are tile-based. Surprise! They’re not. Sorry about that.

Here’s what appears to be a decent link on optimizing OpenGL apps for Intel HD Graphics IGPs:

However, it’s unclear how much of this applies to Intel’s Windows GL driver (this is geared toward Linux/Mesa3D).

Awesome! That seems like a very nice introduction to Intel HD graphics card, and I can already see where it applies to my circumstance. Thank you for linking me to this one!