Huge performance drop when rendering 60 tris on screen (overdraw)

Hi,

I’m studying the root cause of some perfomance hit on a particular case that I’m having which I believe has something to do with the hardware fillrate / overdraw limitations. Nevertheless, I still want to dedicate some time into improving (or find alternatives) for a solution to mitigate this behaviour:

So I have a scene with:

  • 30 cubes:
    • Each one composed of 4 faces, each with 2 triangles and drawn with glDrawElements(GL_TRIANGLES, indicesCount, GL_UNSIGNED_INT, 0);
  • Viewport resolution of 1920 x 1027
  • Deferred shading (6 color attachments)
  • 3 Render Passes (geometry + screen shading + screen post processing)

and when rendering those 30 objects at once in the same position, with a good camera distance from them, I get 60 fixed FPS, with a low GPU usage (~40%)

but when rendering those 30 objects at once in the same position, with a close camera distance, I get around 40 FPS, with a high GPU usage (~90%)

The same distance but with a small separatino between the cubes, I get a little more frames, but still a high GPU usage.

I only want to share this behaviour here in order to discuss it further so I can find some good viewpoints on how to go around it, becuase this could impact when rendering scenes where some particles (compsed of billboarded sprites come in close distance with the camera, or when using transparency with multiple layered objects at once).

Is there any name or known way of dealing with this, because I haven’t found anything while searching for it.

regards,
Jakes

You sure you don’t mean 6 faces per cube?

25 ms for 30 layers of fill at 1920x1027 seems very slow.

What GPU and CPU system is this? And PC/laptop make/model?

That’s likely an aggravating factor. In combination with missing occlusion-based optimizations.

Sure. First thing is to confirm exactly where the bottleneck(s) are though.

How does the time break out between G-buffer filling, lighting/shading, and post-processing? Given that you’re using Deferred Shading and your problem description, I’m going to guess the first is where most time is spent. Verify that though!

Standard Deferred Shading doesn’t do transparency. So at each pixel, you’re trying to capture the 1st opaque fragment along a ray from the camera through the pixel.

So first-things-first. Are you culling backfaces (GL_CULL_FACE)? That’ll toss 50% of the overdraw.

Next-up: Are you doing a Z prepass? If overdraw is an issue, and especially when you’re rasterizing to a very fat framebuffer (6 color attachments), you need to make it quick to reject the stuff that’s occluded. With your geometrically uber-simple scene with very few verts/tris, this should be a big win. The concept is simple. Only rasterize object depths to the depth buffer using standard depth compare (no color writes!). Then when filling your G-Buffer, re-enable color writes, disable depth writes, and re-render with an EQUAL depth test using the same pre-filled depth buffer. Only the opaque fragments closest to the eye will be blasted into all 6 color channels of your G-buffer. The rest will be skipped, either at the fragment or the primitive level. … Alternatively, skip the Z prepass, but render your objects nearest-to-furthest in the view frustum.

In either case, to make maximum use of early Z tests (per primitive and per pixel), be sure to disable any/all pipeline state that might cause problems for it (alpha test, discard, writing depth in frag shader, etcetc.)

Thanks for your prompt answer.

So here is a little more info on the specs of the soft and hardware I’m testing this on:

Hardware:

  • CPU: AMD Ryzen 5 3400G (3.70 GHz)
  • GPU: Nvidia GForce GTX 1650 (GDDR6/4GB)
  • RAM: 24 GBs
  • Desktop PC

Software:

  • Windowed rendering
  • glfwSwapInterval(1), to sync it
  • Using VAOs (obviously with the glDrawElements)
  • Yes, glEnable(GL_CULL_FACE) used here

No Z prepass here, in fact I event tried a simple approach to get rid of a lot of variables, in such that I disabled the lighting pass and rendered directly in the geometry pass onto the buffer, and then the texture onto the screen and I would still get the same results.

Altough, after some investigation, next to what you mentioned about the very fat framebuffer I reduced it to only one render target, and I got the performance back up, even the usage got down to 60%, so this could be a starting point in order to optimize my performance.

About the Z pre-pass, I’m a little concerned mostly because it could work for simple scenes, but for more complex ones, this could be an added complexity to the rendering, which would extend it to add another pass to it, so ending with 2 geometry passes, couldn’t that create other problems down the line?

This could be a better idea form my purpose, although depending on the complexity and the object types, it could also be tricky to adapt it.

Ok, good. This makes sense. Also you can look at packing the data you are writing to your G-Buffer into fewer total bytes and attachments as well. This is often possible using what you know about their content (ranges, required precision, valid combinations, palettes, etc.)

Possibly. However, consider that with your current G-Buffer, your processing is spending a ton of memory bandwidth that’s almost completely wasted, and it’s costing you big time. By contrast, rasterizing depth-only geometry is extremely fast. The pipeline has been hyper tuned for this over decades. And you often don’t need any state changes, making it even faster. Even with a lot of geometry, adding a Z-prepass is likely to be a win, especially if there’s a lot of occlusion. You’re protecting against doing expensive work by doing very cheap work.

That’s one thing the Z pre-pass has going for it. You just toss depths at the GPU and let it sort it out quickly. It’s extremely good at that. It’s also very easy to try too.

I’ve only one other question, regarding the MRT, which I think it might be a common missconception: when adding more color attachments to an FBO, is it the same (or near) as to render the same content in different FBOs?

What I mean is, when making all the calculations in the frag shader and send different values to the gl_FragData[n] seems faster (because its all inline on the same shader) than splitting each target channel to each FBO (with different sizes), is this true or its more complex than that?

And thanks @Dark_Photon for all the input on this topic, happy hollidays.

Probably, but the only way to know for sure is to test it with your use case on your hardware.

The 2 pass/2 FBO requires 2 FBO binds, vs. 1 or 0 binds (FBO binds can be expensive, even if you’re not doing a reconfig, and you’re pipelining the inputs and outputs well). And with 2 pass, you have to pass the geom down the pipe 2X and transform it 2X. So it makes sense that 1 pass might be cheaper, particularly if your scenes are vertex heavy. But if your scene is very fragment heavy. that’s your performance limiter, and you can cut that back with the different sizes, then maybe 2 pass could win out.

Sure thing, and you too!