Multisample shadow samplers

One of the few remaining bottlenecks our renderer hits is shadow rendering in scenes with a lot of foliage. Normally shadow rendering is extremely cheap, but the use of alpha discard in the shadow fragment shader causes a significant slowdown (~50% drop in framerate). I have brought this idea up before, but the bottleneck I have discovered with alpha discard in outdoors scenes makes this much more pertinent.

To use shadow maps, we render high-resolution images (2 x 2048 x 2048 per frame for directional lights with default settings), and then we take a lot of samples in the final pass to blur these high-resolution images.

If we could render to a multisample texture and then sample that texture, performing depth comparison on each of the sub-samples in the texture, it would eliminate so much inefficiency. There would be fewer pixels to draw in the shadow pass, which are not cheap when alpha discard is used, and fewer samples required in the final pass when rendering shadows.

We’re rendering these oversized shadow textures, and then trying to “blur” them in the final pass. It would make so much more sense to use the MSAA sampling pattern on a lower-resolution image to eliminate both of these bottlenecks at once.

Using 4x MSAA the two 2048x2048 images in the scenario above would be reduced to 2 x 512 x 512 x 4 images, or 25% the pixels they had before. This would reduce the expensive foliage shadow rendering by 75%, and reduce the number of samples needed in the final pass by perhaps 50%, as well as a 75% reduction in memory usage. The only thing missing is a sampler2DShadowMS object that provides the free linear filtering we get with sampler2DShadow. I have tried a lot of alternate shadow filtering techniques, and nothing comes close to the performance of sampler2DShadow.

Why isn’t this a thing?

I actually think this would be possible if you copy each sub-image of a 2DMS texture into multiple 2D textures. The texture copy would incur additional overhead, but it would probably still be faster than conventional shadow maps.

Using 4x MSAA you would be rendering to 25% as many pixels, and then performing 4 texture lookups in the final pass that would simulate 16 texture samples.

Is it possible to copy the contents of one sub-image in a 2DMS texture to a 2D texture without resolving the subpixels, or using a fragment shader?

You can conceptually think of each sample for all of the texels in a multisample image as being one “sub-image”. But that’s not how the image data are stored. The data are generally stored swizzled, but with samples in the same texel near each other. This is because when a shader process reads one sample, neighboring shader invocations generally also want to read neighboring samples.

This means that samples are often in the same cache line. As such, if you were to try to copy all of a particular sample to a separate image, that process would likely read every single byte of that image anyway.

It’d be like trying to copy just the red channel of an RGBA8 texture. That process must involve reading the GBA data too and just ignoring it.

You could read the whole thing and parcel each sample out to its own image at once. But there is no hardware support for doing that. And I’m not sure how that would ultimately help when it comes to shadow mapping (or solving the discard issue you talked about in the first post).

That makes sense, thank you for the information.

I am actually thinking I could render an extra pass from the MSAA shadow into four separate depth textures, using sampler2DMS to get each sub-pixel, and then writing that depth to one of the four outputs. I would not be surprised if reading four samples from the same coordinate of a 2DMS texture is more efficient than reading four arbitrary texture samples.

The shadow-rendering step itself seems to incur a significant performance penalty when alpha discard is used, that we don’t normally associate with shadow rendering. It seems to be a function of the number of fragments written. So if we reduce the number of pixels by 75%, I think this would target that bottleneck.

It will be interesting to see if the result actually is faster overall. :slight_smile:

But it would also reduce the alpha discard fidelity by 75%.

Doing multisample rendering to a 512x512x4sample image is not the same thing as rendering to a 1024x1024 image. The whole point of multisampling is that you execute the fragment shader only once for all samples within a pixel, masked by the coverage of the primitive over that pixel. Which means that any discard operation (executed by the fragment shader) would either write to all of the covered samples or none of them. So the granularity of your shadow map with regard to discarding would be 512x512.

And if you use per-sample shading, which causes the fragment shader is executed once per sample… then you’re just rendering to a 1024x1024 image.

The reason using discard in a fragment shader causes a performance issue is that discard requires shutting off all forms of early-depth processing (hierarchical depth, etc). Early depth processing doesn’t just do the test first; it does the entire read/modify/write operation first. So if a fragment shader could come along and discard a fragment that it has written the depth for, that would be bad; a sample would have its depth modified but not its color or other parameters (and you’re only interested in a correct depth value). As such, if an FS has a discard written anywhere, then the renderer will shut off all early depth logic.

Shadow rendering normally uses an insignificantly-tiny fragment shader, so its primary bottlenecks are bandwidth and depth testing. Naturally, turning off one of the main tools for speeding up depth testing in an operation whose performance is largely dominated by the cost of depth testing is not a recipe for high performance.

And in case you’re wondering, modifying the fragment’s depth conditionally also will do the same thing that discard does.

Ah…good point. I know exactly what you are talking about.

And in case you’re wondering, modifying the fragment’s depth conditionally also will do the same thing that discard does.

By “does the same thing” I am pretty sure you mean it disables early depth discard. Each sub-pixel in a depth 2DMS normally contains a different depth value.

I wonder if writing the depth would produce the desired outcome? Like, instead of alpha discard, write 1.0 to the depth output when a pixel is supposed to be discarded. It would be better for performance to have no alpha discard and no depth modification, of course, but this would only be used for alpha-masked materials.

That’s why I brought depth modification up: it would have the exact same performance issue as using discard. And the same granularity issue, since the fragment shader is only executed once for a group of samples, so all samples in the coverage area would get the same depth.

That having been said, I’d forgotten about a more “recent” feature: conservative depth. This allows a fragment shader to tag gl_FragDepth with some information about how you intend to modify the depth value. If you will always make the depth value greater than its current value, then some of the early depth functionality may be preserved, depending on the nature of the depth test.

It’d be worth trying that out to see if it helps.

I need to verify that. I access individual sub-pixels in a 2DMS texture in our SSAO post-process effect to eliminate edge artifacts, and it seems to work.

Will do!