A bottleneck while rendering to depth map

Definitely an interesting problem. Doesn’t sound like we have control over all the differences between UE’s rendering and your renderer’s rendering, but here are a few more ideas for things you might check to help track this down:

  • Backface culling enabled?
  • DepthFunc LESS not LEQUAL? (greater depth test rejection)
  • SCISSOR_TEST, STENCIL_TEST, ALPHA_TEST, BLEND all disabled?
  • GL_POLYGON_OFFSET_FILL disabled?
  • PolygonMode FRONT_AND_BACK = FILL?
  • ColorMask(0,0,0)? StencilMask(0)?
  • GL_MULTISAMPLE and POLYGON_SMOOTH disabled?
  • All of the above are “not” changed between glClear( DEPTH ) and the end of shadow render?
  • 16-bit indices? (2X rate on some hardware; not sure but might affect fragment scheduling)

Could be, on your specific GPU+driver.

Might be worth comparing perf against glDrawElements (non-instanced) with pseudo-instanced tree geom. Just to verify that instanced isn’t triggering some less efficient rasterization scheduling.

And you confirmed it’s not MultiDrawIndirect (MDI), so this issue probably isn’t coming into play, though it could conceivably for DrawInstanced. Just in case it does and this might trigger some fragment scheduling inefficiency, it’s probably worth comparing against glDrawElements

  • How does the UE depth prepass affect shadow gen render? Or does it?
  • I wonder if UE is doing some implicit small feature culling.

That’s worth testing. Though I’d suspect you’re getting good texture cache hits.

Do your tree textures have MIPmaps? And you have trilinear filtering enabled? (Less texture lookups, smaller texture cache footprint)

Is the alpha in your renderer’s texture MIPmaps computed the same way as in UE? (That goes to discard rate.)

It could be. It certainly is a difference. But given what you’ve mentioned above, I’m not sure this conclusion follows.

Could be. But in your image above, you’re looking at a top-down view. So all the trees are probably being plastered into the same split (…depending on how UE choses the split distances).

You could force UE to always use one split. That’d be more similar to your renderer. And on that note…

Syncing the shadow gen near/far between UE and your renderer may be very important for depth precision and fragment acceptance/rejection rate, particularly with a small 16-bit depth buffer and all the geometry rendered at approx the same depth.

Ok, except if your analysis of “SM warp stall drain” is correct (0.3% → 22%), then more of them are stalled waiting on writes. Might suggest fewer fragments rejected by the depth test (and/or discard) and having to update the depth buffer in your renderer.

On that note, you might look at these GL pipeline statistics in your renderer, and compare to the same in UE if you can. Could provide some useful clues:

  • GL_FRAGMENT_SHADER_INVOCATIONS_ARB
  • GL_CLIPPING_INPUT_PRIMITIVES_ARB
  • GL_CLIPPING_OUTPUT_PRIMITIVES_ARB

Maybe some differences in frustum culling at-work here?

Ok, that’s still going to kill off more fragment shader executions, and avoid more texture lookups/stalls and fragment depth writes/stalls.