Cascade Shadow Mapping: possible optimizations

Why the Scissor test is a Per-Sample Processing operation?
Why it is not performing before the rasterization?
And what happens if all vertices of primitive located outside of the viewpoint?

The scissor test happens on fragments. Before rasterization, there are no fragments.

Nothing. Just because all of the vertices of a primitive are outside of the viewport does not mean all of its fragments will be. Consider a triangle at the upper-right corner of the screen, such that the edge is on the screen, while all of its vertices aren’t.

If you want to clip the primitives, change the viewport. But note that the set of rasterised fragments isn’t necessarily constrained to the viewport. In particular, wide lines and points can affect pixels outside of the viewport. But if the scissor test is enabled, pixels outside of the scissor rectangle won’t be modified.

So if you want to constrain rendering to a rectangular portion of the window, set both the viewport and the scissor rectangle. Setting the viewport alone won’t prevent rendering from “bleeding” outside the rectangle, while setting the scissor rectangle alone will result in GPU cycles being wasted generating fragments which are subsequently discarded by the scissor test.

If you’re only rendering triangles, then you don’t need to use the scissor test, as rasterising triangles never produces fragments outside of the viewport. However, if you want to maintain the overall transformation between eye coordinates and window coordinates independent of the viewport, you’ll need to adjust the projection transformation to compensate for changes to the viewport.

I render openworld - visible area is a small part of a huge location.

If scene is rendering using culling (many objects just rejected),
But ‘Cascade Shadow Mapping’ pass render all objects to depth buffer three times.
I Think it can be dramaticaly reduced.
One thing make it not so easy: we can see shadow of object which located outside of the frustum.

I’m just trying to find a way to reduce rendering time of sun shadows (DirectLightShadowRenderer.render - value in microseconds):

root
–L--43753: ZEngine.makeLoopStep
----L–19: ZEngine.update
----L–43375: ZEngine.render
------L–41606: Renderer.render
--------L–29: BufferPreparer.render
--------L–14165: CullNodeRenderer.render
--------L–1989: CullItemRenderer.render
--------L–18740: DirectLightShadowRenderer.render
----------L–4500: DirectLightShadowRenderer.renderCascade[0]
----------L–7053: DirectLightShadowRenderer.renderCascade[1]
----------L–7104: DirectLightShadowRenderer.renderCascade[2]
--------L–9: PointLightShadowRenderer.render
--------L–2: SpotLightShadowRenderer.render
--------L–121: ReprojectionRenderer.render
--------L–5570: Renderer.renderScene
------L–12: ZEngine.sync

If implemented naively, sure. But that’s not a given. For instance, you bin by shadow map split during shadow frustum cull. Whether you get a benefit is going to depend on you have your scene batched and the hardware you’re rendering on.

The light frustum is different from the eye frustum. You need cull for your shadow render pass with a light frustum.

What does that frustum look like? Where you’re rendering shadows from the sun, you’re using an orthographic (as opposed to a perspective) projection. So it’s natural to think that of course you’d use this nice 3D box orthographic frustum to cull against. Well, you could. But is that really want you want? Does that really tightly define the space? You really want to capture all occluders toward the light source that could possibly cast on any receiver in your camera view frustum. That’s not a nice 3D box. This can help you cut back on how many objects get culled into your shadow frustum, particularly when the sun angle is low and your frustum is sweeping out large distances along the terrain, depending on how your large your scene batches are and how expensive they are to render vs. cull. There are all kinds of tricks like this you can use to optimize shadow frustum culling. For references, see the GPU Gems and ShaderX book series.

I implemented two improvements using this chapter:

  • emit objects to all three layers of shadow array texture at once in GS.
  • emit objects to all three layers by different instance of GS.

That gives me more performance, but not enough.

Also my app perform occlusion culling using octree.
But I can not use this technique for shadow mapping because it expensive enough.
But I can use octree, something like that:

  • select level of octree nodes with big enough volume for contains whole eye frustum.
  • use objects from current node with camera location and the nearest ones, which touch current one. (also for improvement I can use direction of camera and select only visible nodes)

Also I have another one idea:

  • for far layer of shadow map I can just reproject shadows from previous frame (for example 4 times of 5). And render it only each fifth time.

Whenever referring to old tech sources, you have to keep in mind the age of the source and whether the assumptions behind the articles are still valid today.

For instance, at the time that book was written, Geometry Shaders were all shiny and new, and expectations for them were high. Further, that book was published by NVidia and they clearly wanted you to start using them to increase demand for new GPUs.

However, that was 12 years ago, and it’s long since been realized (and repeatedly experienced) that geometry shaders do not perform well when used for geometry amplification (which is what you’re using them for). And there are well defined reasons for this slowness.

That’s not to say that they are a bottleneck for you, much less your chief bottleneck. Just that you shouldn’t have grand expectations that applying geometry shaders here is going to give you a huge speed up over iterating an efficient draw loop which renders well optimized batches.

Don’t let your implementation drive this. First, define what you need (in terms of performance). Then determine what your biggest performance bottleneck is. And then, figure out what options you have to get rid of it. …rinse + repeat until performance meets your goals.

This is a basic rule of any optimization case. And I already figured the bottleneck out. That is DirectLightShadowRenderer, because of rendering all objects of the huge scene (You can see the result of profiling in one of my previous comments).
Now I’m trying to find ways to optimaze it.
Of course the most important point is a reducing number of objects those using for casting shadow.
Now I’m in analytic stage: just collecting possible improvements to select the best one.

The points:

  • the scene is a openworld (the huge location when camera can see small part of it)
  • the most objects of scene are enough static objects (objects can be transformed but, can not be moved: the possibility to precalculate bounding box for all transformations).
  • whole scene lets me precalculate some sort of grid and use that grid to cull objects for shadow rendering.

Articles I found and yours comments make me think that my optimization is:

  • split the scene to boxes.
  • the boxes are big enough to contains whole (eye view) frustum.
  • using orthographic (sun view) frustum I can calculate bounding box of it, that oriented in a world axes lines.
  • while main thread of app calculates culling for scene rendering (on the GPU), I can do all described lightweight actions in additional thread on the CPU.

May be I skipped something important?

That’s the global knob that affects everything, sure. However, do you know how the different stages of your shadow map rendering rank in terms of their performance impact? What’s your primary bottleneck within your shadow render pass?

Are you primarily fragment/rasterization limited? What happens to your perf if you reduce the res of your maps down 2X or 4X? If you’re thinking, no, the res has to be that high to avoid shadow edge crawling/flickering artifacts for all cast shadows, consider that you can completely get rid of 100% of all of those artifacts for objects which are not moving (which you said comprises most of your scene), regardless of the res. Also, are you sure that you have your render state optimized for maximum fill performance on your GPU? Make sure your shadow casting frag shaders aren’t wasting time computing nor writing color (nor even having a color attachment), and you have color buffer writes “disabled”. And if you have much shadow overwrite, maximize your use of early Z (e.g. clearing the depth+stencil buffer + not using discard, alpha-test, alpha-to-coverage, nor setting depth in your frag shader, nor changing the depth function mid-pass, etc.) as well as enable backface culling for opaque objects.

Are you primarily vertex limited? What if your objects had 1/2 or 1/4 the verts that they do now, for the purposes of casting shadows? How does that affect your perf. If this makes a big difference, then ensure that you are making effective use of geometry LOD. Maybe you could easily render everything if it wasn’t just so darn vertex-heavy.

Are you primarily CPU limited, spending too much time just trying to figure out what scene objects to throw at the GPU to rasterize into your shadow maps? If so, with most of your scene static, it sounds like you’ve got huge potential to optimize nearly all of this overhead away with effective use of one or more spatial acceleration data structures, such as a BVH or spatial subdivision (sounds like you’re using the latter).

Yes, that is my case.

It is dramatically speeded up!
Frame times in microseconds:
4096 x 4096 - 46305
2048 x 2048 - 23738
1024 x 1024 - 17493
512 x 512 - 16826

I realized that GS optimizations not increase performance at all, actually the situation is completely opposite of that.
Profiling showed me increase speed for ‘direct light’ renderer, but whole time of frame rendering was increased.
I understood, that I can see not finish time of operation of GPU, but just time of a receiving command by GPU.
I mean GPU pipeline is working asynchronously, that means I can not know when exactly command was been finished.
Well… time of whole frame rendering shows me a decreasing and I backed to using 3 pass rendering without GS.

Do you mean technique with additional ‘silhouette map’?

What do you mean?

I’m using:

    glColorMask(false, false, false, false);
    glDrawBuffer(GL_NONE);
    glReadBuffer(GL_NONE);

I’m using the discarding, because some object transparent in some areas.
FS:

#version 420

in vec2 vTexCoord;

struct Material {
  vec4 ambient;
  vec4 diffuse;
  vec4 specular;
  int hasTexture;
  int hasNormalMap;
  float reflectance;
};

uniform Material material;
uniform sampler2D texture_sampler;

vec4 ambientC;

void setupColours(Material material, vec2 textCoord) {
  if (material.hasTexture == 1) {
    ambientC = texture(texture_sampler, textCoord);
  } else {
    ambientC = material.ambient;
  }
}

void main() {
  setupColours(material, vTexCoord);
  if (ambientC.a == 0.0) {
    discard;
  }
}

Why I should clean stencil buffer? I don’t use it.

There are GL tools built in to tell you when things execute on the GPU. However, for whole-frame profiling purposes, you just want to know when “everything” is done on the GPU.

On a well implemented GL driver, you can determine this easily.

Yes, with CPU draw thread-based timers, you are timing how long it takes to submit commands to the GPU. However, for profiling purposes, you can have these same CPU timers include how long it takes the GPU to finish executing these commands as well. Here is what I’d suggest:

Add a “profiling mode” ON/OFF toggle to your app, with the default being OFF. When OFF, do what you’re doing now. When ON, all that changes is this:

  1. Disable VSync, and
  2. After your SwapBuffers() call, call glFinish(), sample the frame timer, and reset the frame timer to 0.

At this point, your sampled frame timer now gives you the total time it took both the CPU and GPU to both submit and render this frame. Use this for profiling and analysis purposes. This also gives you very consistent frame-to-frame latencies.

Yes, this does prevent the CPU from starting to compute+submit frame N+1 (and N+2 and …) while the GPU is still rendering frame N. However for profiling purposes, you don’t care as much. You just want a very solid, consistent, accurate number with which to make optimization decisions. Also, in the general case where you let the driver queue ahead, you have to be much more conscious about what you’re doing and in some cases just know what the GL driver is doing under-the-hood in order to avoid implicit synchronization issues due to resource dependencies between the frames that you submit to the GPU. In Vulkan, you have to know about and actively deal with all this, but in GL, it’s largely hidden under-the-covers. However, in GL it still occasionally rears its ugly head with a nasty performance spike, frame drop, or microstuttering, leaving you the developer holding the bag to somehow figure out what the GL driver is doing and determing what you need to do differently to remove the cross-frame resource dependency.

No, i mean quantizing your shadow frustum movement through space to 1 shadow texel increments. This so that it “creeps along” with your eyepoint, but to the viewer it looks like it’s not really moving at all.

See right below that comment. I went through it there. Basically, just doing everything you can to ensure that the GPU+driver is given every opportunity to do the absolute minimum amount of work possible to get everything rendered.

Consider rendering your “opaque stuff” first, using a shader which “does not” use discard or ALPHA_TEST. See if you don’t notice a performance boost when there is a lot of occlusion (e.g. when the sun is down near the horizon)? I can tell you back when I was implementing Cascaded Shadow Maps, I saw a nice boost.

If you set up your rendering state right, the driver is throwing whole triangles in the trash can in the middle of the rendering pipeline before they even reach the fragment shader, which saves fill. It can also suppress fragment shader executions for whole blocks of pixels of triangles which are partially visible, which also saves fill. Since you’re fill limited, it’s worth checking this out.

If you don’t need to, don’t!. DEPTH24_STENCIL8 is pretty universally supported, but if you don’t need it use GL_DEPTH_COMPONENT16 or GL_DEPTH_COMPONENT24. Just for testing purposes, you can try GL_DEPTH_COMPONENT32 and GL_DEPTH_COMPONENT32F as well if you want. But if you don’t need the precision, there’s not much point of using these.

Randy

Yep, I untouched fragment shader (ignored transparent areas at all) and I saw good boost!

So, as a result of our discussion I have know two new points, those was hided from me before:

    • simplest rasterization (default fragment shader) for opaque objects.
    • redusing resolution of shadow texture. Use other much more cheaper techniques instead high resolution, such as
      • ‘quantizing movement’
      • ‘silhouette map’ that much cheaper and can dramatically increase shadow quality.

Photon, you help me very much.
Thank you a lot!

Sure thing! Good luck!

BTW, here’s a pointer to the source for the “quantizing the shadow map frusta” technique:

  • Shader X6: Chapter 4.1: Stable Rendering of Cascaded Shadow Maps (Michal Valient)

Or just websearch the author and name of the chapter and you’ll come up with a number of hits. Looks like Jonathan Blow and Matt Pettineo have some good stuff, among others.

Michal’s chapter is an excellent read for this and other reasons. One of a number of good “tips and tricks” sources for Cascaded Shadow Maps.