Is there anything to ignore "discard" GLSL instructions to improve depth-buffer performance?

Mick_P · August 7, 2022, 10:05pm

Supposedly discard can be bad for performance, but is it really good to make 2 copies of a shader to prefer one without clip with textures that don’t require it? There used to be alpha-tests as a global state. Ignoring it would allow the shader and state to not be moved, if that’s a problem. (I don’t actually know how GPUs manage multiple shaders at the same time, assuming they can.)

GClements · August 8, 2022, 12:18pm

Using discard prevents the early depth test optimisation. How much effect that has depends upon the extent to which you’re benefiting from the optimisation, which in turn depends upon the amount of overdraw and whether you’re actively rendering from near to far, and the cost of the the fragment shader (in terms of both GPU cycles and the bandwidth used for texture fetches).

Using two shaders shouldn’t be an issue so long as you group draw calls by shader so you aren’t constantly switching shaders.

Dark_Photon · August 8, 2022, 12:43pm

+1 what GClements said. TBD based on how heavy your frag shader load is.

There still is, for any driver that supports both Compatibility and Core profiles. It’s just a matter of whether the driver will throw an error if you try to use it.

Not sure if I catch your drift here. But they totally can and do all the time, abstracted behind a single GL program object handle in-fact. I’ve tracked/fixed a number of these in the past. You use a compiled+linked+prerendered shader program (so shader ISA hot-and-ready-to-use in the driver) with a new set of GL context state, and – for driver-specific reasons – this causes the driver to stop right there, go and build a new shader program that matches that GL context state, and only then can drawing continue. Of course, this is a major obstacle for achieving consistent rendering performance.

If you do find that you need the perf benefit of alpha-test but don’t want to conditionally use it (i.e. never use it or always use it), you might consider variable rate shading (VRS) and/or alpha-to-coverage. VRS of course launches fewer frag shader invocations, reducing frag shader load, while still giving you the capability to discard/ALPHA_TEST if you want. Alpha-to-coverage handles alpha differently, populating a coverage mask with it and letting the MSAA downsample perform the blending. I don’t know for sure, but I’d expect that even with ALPHA_TEST disabled, there’s probably some optimization in GPU’s MSAA rasterization such that if the coverage mask is 0 across an entire sample/pixel/warp/wavefront, that the back-end rasterizer might skip some needless memory writes/blends. Dunno; worth trying anyway.

Mick_P · August 8, 2022, 4:03pm

Are you recommending VRS because I mentioned I was doing VR in another topic yesterday? Because it just happens I spent most my time yesterday implementing that. I.e. fixed-foveated-rendering, and it was definitely worth it.

I avoid back-end AA techniques. I was very happy that yesterday I managed to eliminate all signs of pixels in my game app in full super-sampling mode in VR (looks like 3x native) by applying a no-cost AA technique that maybe I’m the only person in the world who knows about it and uses it. It is pretty cool to see VR without any pixels, but since you work with Varjo you probably know all about that. (EDITED: I mean, it’s not full FOV but to me it looks as detailed as reality itself short of maybe some crispness here and there, but not square pixels. I’ve very impressed with PCVR except for god rays and dim/banded colors. I’m using HP Reverb G2.)

RE “rebuilding the shader to match state” I saw that some when I tried to use ANGLE with the newer graphics APIs. It was inside the driver and not in ANGLE’s software layer. That’s pretty crazy and probably shouldn’t be allowed… but I assume that the driver keeps both/all mutated versions of shaders and doesn’t just thrash them, but it still has to switch in response to state. I don’t think sorting textures based on the presence of colorkey style pixels would be super practical in terms of all the other sorting goals. It does seem like it would be good to have a hardware toggle to force off discard so what @GClements GClements said could be relied on… but what do I know.

As for what I meant about this… is I assume that GPU workloads have multiple shader pipelines operating simultaneously, but I’ve never looked into it. I’m sure it must at least have multiple GPU “cores” working on different apps/contexts simultaneously… anyway, that’s all I meant by “multiple shaders” since you asked.

Dark_Photon · August 9, 2022, 2:58am

Ok, good deal! No, I didn’t suggest it because you were working with VR. But rather because in the past I’d used VRS to greatly reduce a bottleneck with some heavy fragment shading at high resolution (w/o VR in-play). VRS is great – very high payback for very little effort.

Yes. The problem is the GPU vendor doesn’t advertise a list of scenarios where one will occur. Leaving the poor GL dev to infer the causes and try to avoid them. Even with Vulkan, we still don’t have these causes on-the-table, and Vulkan is still trying to solve the core problem here without requiring that.

Right. In NVIDIA’s case, when a shader recompile is encountered, you’ll see some message emitted to the debug callback like:

Program/shader state performance warning: Vertex shader in program 2 is being recompiled based on GL state

and if this permutation hasn’t been hit before (or the shader disk cache has been cleared or disabled), it’ll be followed up with:

, and was not found in the disk cache.

But yes, the driver still has to lookup the generated shader, load it into the GPU, and get it hot-and-ready to render with which does take some time. But not like if it’s not even in the disk cache (or the disk cache is on a network drive). The hit from that is pretty large.

Oh, I definitely mis-infered there. Sorry for the tangent.

Mick_P · August 9, 2022, 7:19pm

Okay, just out of curiosity and chumminess, do you mean to tell us VRS is being used on flat screens? Is it because people aren’t expected to look at the edges of the screen, or is it because some features (polygons) need their shader more than their detail? I take it you’d just use a 1x1 mask in that case. (FWIW in my case I just knew what GClements was talking about was taking a bite out of my FPS because I never programmed a second path for textures without “cutouts”. I still feel like there should be an off switch for “discard” because it has such a special role. I assume someone has thought of it and decided not to build one into the firmware.)

Dark_Photon · August 10, 2022, 4:01pm

The latter (nothing to do with eye tracking). With today’s super-high display resolutions, fuzzy, animated effects really don’t need that kind of shading rate. So dialing them back to 1 frag shader invocation per 2x2 or 4x4 pixels is just fine. Saves on fragment shading cost, but of course you still have to pay the full blend/ROP cost. A better solution is tile-based shading which can additionally save tons of blend/ROP cost, but I didn’t have that kind of time (it was one of those “oh crap, it’s not fast enough” moments during last minute “turn all the knobs on” testing).

Dark_Photon · August 11, 2022, 12:15am

I was just re-noticing that your comment above was in your title for this thread:

Is there anything to ignore “discard” GLSL instructions to improve depth-buffer performance?

There is:

layout(early_fragment_tests) in;

which’ll let you force the early depth/stencil/etc. tests before executing the fragment shader, regardless of discard / GL_ALPHA_TEST presence or operation. But since you have to mod the fragment shader to add this, I’m not sure that this really buys you anything over getting rid of all discard instructions in fragment shaders. Then again, I don’t know the specifics of your use case.

Mick_P · August 11, 2022, 12:22am

Yes! I’m just expressing my feelings. RE “use case” I was definitely thinking (wondering) along the lines of glEnable(GL_SOMETHING) as an analog to alpha-test as an optimization that I think would be nice to have since “early z” is important but shouldn’t require bending over backward to get that 20% speedup I’ve read.

P.S. This is rhetorical, but that’s an interesting scenario (layout(early_fragment_tests)) that makes me to want to look into why it exists, but I can use a search-engine for that hopefully (they don’t always work… they seem to return less relevant results nowadays for some reason.)

Dark_Photon · August 11, 2022, 12:38am

There’s a pretty good summary here:

Early Fragment Test (OpenGL Wiki)

IIRC some GPU drivers/HW support both the concepts of early Z and hierarchical Z. Early Z where individual fragment shader invocations are killed/masked, and hierarchical Z which may pre-kill entire rasterization tiles, avoiding entire warps/wavefronts of frag shader invocations from even being scheduled on the GPU. Similarly for stencil/etc.

Mick_P · August 11, 2022, 12:55am

Cool! I love to learn about internal details like these. Awesome

Alfonse_Reinheart · August 11, 2022, 2:33am

It should be noted that, if you turn on early fragment tests, your attempts to do “alpha testing” will still write the depth value to the depth buffer even when the alpha test in the FS discards the fragment. So to make this work, you would have to render back-to-front

Early fragment tests have often been an optimization. As to why the GLSL switch exists, it is mainly for doing image load/store-style operations. These writes are still visible even if a fragment is discarded. Since the depth test is (normally) specified to happen after the FS, this means that fragments culled by the depth test (or stencil) can still update other memories. If you want to prevent that, you have to do such tests before the FS executes.

Mick_P · August 14, 2022, 12:29am

Sorry to bring this back to life, but do you know here (Early Fragment Test - OpenGL Wiki) in the little box on the right side, why does it say “core since 4.2” and “core in 4.6”?

I’m curious, but my min version is 4.5 so I’d have to bump it up.

This might help because I got strong impression that I was seeing a speedup if I disabled all discard in the shaders, but not if I mixed shaders with and without discard. I wonder if changing to early-Z forces the depth-buffer to be synchronized and maybe Nvidia just opts out as soon as it sees a mix and match.

Like maybe changing this state is worth it sometimes and sometimes not, I don’t know. It really seems to me (I’ve said this before I think in other topics I may or may not have made) that discard doesn’t really prevent early-Z test at all, since it doesn’t change the depth… I mean if there’s a Z-test that’s in front it shouldn’t matter… in theory… but what people say I think is it’s just too messy for the hardware to parallelize.

Edited: Also I just wanted to say in; is such a strange syntax, I wouldn’t have expected that to be a thing

GClements · August 14, 2022, 1:23am

It (the ARB_shader_image_load_store extension) was added to core OpenGL in version 4.2 and is still in core OpenGL in version 4.6.

At the moment, the “core in 4.6” part is redundant. The wiki only has reference pages (with those summary boxes) for modern OpenGL, and (AFAIK) no extension which has been added to core since 3.0 has been removed.

Dark_Photon · August 14, 2022, 8:51pm

That would make sense.

Years ago here on the forums when ZCULL / Hi-Z (short for hierarchical Z) and Early Z were discussed, benchmarked by users, with vendor behavior detailed, that was pretty much how it worked. The driver caches some depth-related info that it can use to pre-cull specific fragments or even whole primitives before fragment shading, assuming your code “plays by the rules”. While the driver can use this cached depth info, it can save perf by avoiding some rasterization work (or even scheduling that rasterization work!). But as soon as your code violates “the rules”, then the driver must disable use of that cache because what you did invalidates its contents (or could invalidate it in the future).

For instance, GL_ALPHA_TEST / discard reportedly disabled ZCULL / Hi-Z until the next depth clear IIRC, and could reduce the Early Z benefit. Intuitively, changing things like the depth function in the middle of rasterization pretty much kills the benefit of these internal driver optimizations as well. Once you read up on how to manually perform hierarchical depth testing, this all makes more sense.

Similar internal driver speed-ups for stencil too IIRC, though that got less attention.

Possibly. It depends on how it’s implemented. Early-Z being a perf-fragment test (and ZCULL per-primitive). You can easily see that discard would reduce the benefit of Early-Z, what with discard from previous triangles “blowing holes” in an otherwise smooth cached depth buffer representation, causing future early-Z tests to kill fewer fragment shader executions. And if the Early-Z cache is updated before the fragment shader rather than after, then this Early-Z cache is going to go invalid with a discard (and so would be disabled). In fact, see the last link in the post below for AMD/ATI saying that this was what was done in their drivers years ago for “older” GPUs, but not in the “newer” GPUs (keeping in mind this doc is from 15 years ago, so all GPUs mentioned are old now).

Related: A post from 3 years ago with NVIDIA and AMD links on this that still work!

Common optimizations (2019-10)

In the NVIDIA sources, search for ZCULL and Early-Z. In the AMD sources, search for Hierarchical Z (HiZ) and Early-Z. In both, search for discard.

system · February 13, 2023, 8:51pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.