Sequence of shaders suddenly slow if shaders in the middle are optimized

I’m running into a strange performance problem I do not find any more angle to attack. Maybe somebody has an ideas what could cause the problem here.

I’ve got a simple sequence of deferred rendering shaders:

  • Shaders A: render depth buffer
  • Shaders B: render material buffers
  • Shaders C: render SSAO (consumes A, B and writes to material buffer)
  • Shaders D: some lighting stuff
  • Shaders E: render sky shadow (consumes A)
  • Shader F: render sky light (consumes A, B, C and E)

So far nothing special. I did some optimization of the “SSAO” (C) and cut the time down by ~1.2ms . Measuring the performance with RenderDoc I do get over the entire frame render time an improvement of ~1.2ms. So this is what I hoped for.

I then tested this against VR where render sizes are a lot higher compared to regular PC monitor. In this situation my SSAO optimization shrug off ~7.2ms which is quite substantial. But now comes what I didn’t expect. The entire render time as measured in RenderDoc is pretty much equal, maybe marginally faster.

Examining the measurement in details I noticed that shader F before the optimization clocked in at 260us. After the optimization the same shader F suddenly clocked in at 2.6ms(!) and I did not even change it at all. Some other shaders even later than this also suddenly exploded in time eating up all the improvement I made.

After some testing I noticed that if I artificially make shaders C expensive the duration of shader F goes back to the original value. It thus looks as if shortening shaders C causes shader F to become more lengthy. How can this be?

Shader F consumes the depth buffer from A and the materials buffers from B and C.

I know the GPU can only start rendering if all the input textures have been finished rendering to. But no matter if shaders C are lengthy or short the latest texture consumed by shader F had been written by shaders C. Also shaders C consumed depth from A and material buffers from B. If this would be the problem shaders C must be delayed too but this is not the case.

Do you have any idea what kind of GPU design might cause here problems? How can I further debug this problem? RenderDoc can not help here anymore and AMD provides no OpenGL performance tool anymore.

Besides Renderdoc, you haven’t said anything about how you are timing or what your timings even represent?

  • CPU time?
  • GPU time?
  • From what-to-what?
  • VSync OFF?
  • Any explicit sync forced? Between frames? Between sub-frame timing intervals?

Or how you are rendering?

  • Are you double-buffering any resources?
  • Are ops likely to trigger implicit sync within your frame draw (e.g. reconfiguring FBOs, etc.)

Do you know for sure “Measuring the performance” with RenderDoc has absolutely no effect on the resultant timings? If not, I’d ditch it and focus on frame-to-frame times measured by your app, with explicit sync between frames (glClear window + glFinish + stop/start frame timer) to ensure no frame-to-frame overlap of CPU queueing or GPU execution). And VSync OFF of course. And no VR runtime in-the-loop!

Optimize that first. Get that solid. Isolate component timings as needed until everything makes sense. Then build-up from that. Step-by-step.

You could consider popping in an NVIDIA GPU, installing NVIDIA drivers, and use Nsight Systems and Nsight Graphics for CPU and GPU profiling and perf analysis.

Without knowing how you’re timing, this could be any one of a number of things. If you’ve been doing much optimization, you know that optimizing one stage can lead to zero perf improvement if it wasn’t the bottleneck. Alternatively, if you’re comparing GPU time vs. CPU time or CPU+GPU time here, then this is an apples-and-oranges comparison. Alternatively, it could be that times you’re inferring to be execution-only time are in-fact queuing + optional flush + optional implicit sync + possible partial execution times.

Also, anytime I see “VR”, I’m immediately suspect. VR runtimes (and their compositors + boatload of threads/processes that run behind-the-scenes) can greatly affect the load on the system and reduce the performance of your application. Yes, there’s a difference in render res and refresh rate. But there’s also a big difference in CPU+GPU load and contention that your app has to deal with. They also tend to force VSync ON (even if you had it OFF when rendering to the window), as you are synchronizing rendering to the VR compositor’s scan-out clock (through the VR runtime) and not the desktop window manager’s scan-out clock (possibly monitor-driven or virtualized depending on mode). This can radically affect your frame timings, your frame-to-frame overlap, and likelihood of triggering implicit sync, depending on how you are collecting timing measurements.

So on this “VR performance issue”… You might do some testing “without” the VR runtime and it’s processes in-the-loop, and instead just do some testing of your app pushed up to the target render resolution for VR but still rendering to a window. That’ll give you crucial info on how much of the “VR performance” problem is due to:

  • your app purely becoming fill limited for scene rendering (at least for portions of its rendering), vs.
  • the added CPU/GPU contention, different workload, and possibly forced-VSync associated with rendering through the VR runtime.
1 Like

In RenderDoc the tmings are GPU time of single draw commands done using whatever GPU counters are supported. For groups of commands RenderDoc sums up the individual GPU durations.

In this problem the timing of shader C (from 260us to 2600us) is a single draw command (a fullscreen quad).

VSync is automatically handled by the system Compositor so definitely not off.

Explicit syncing is used with glMemoryBarrier() after compute shaders and at other places using glClientWaitSync. This fencing is though used way outside this code and only outside timed code as it is used for resource loading and knowing when read-back is finished. So I do not think this could cause problems.

The actual barriers used in shaders C are GL_SHADER_IMAGE_ACCESS_BARRIER_BIT and GL_TEXTURE_FETCH_BARRIER_BIT. Incorrectly leaving them out though has no effect on the behavior. I can though also reproduce the problem with older shaders done before the optimization where I had no compute shaders. If I force them to be artificially faster the same behavior shows up.

Double buffering is use. FBO reconfiguration is used for shadow mapping. This affects though only the shadow map. No content rendered prior to this point is used for this.

RenderDoc does not enlarge the timings since these are captured by the GPU. It does slow down the application due to the CPU work it does but the timings so far are at comparable. If I do frame-to-frame timing without RenderDoc I see a similar problem so RenderDoc is not messing something up here.

I’ll try disabling Vsync and see if I can go up with the resolution on a window.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.