Odd intermittent frame time spike that resolves when enabling debug context

Got a weird issue that I’ve been able to drill down into a specific point, but am now hitting a road block. I’ve been working on an engine for awhile and recently noticed that I had an intermittent (drastic) decrease in framerate for a single frame. The first time it happens is almost exactly at 30 seconds into the application runtime every time. I was able to profile the application and discovered that the issue is somehow related to binding textures (or so it seems on the surface), as I could watch the profiler and as soon as the drop happened it was reported that my method responsible for binding textures and passing uniform data was taking up to 80ms to complete (tested multiple times and results varied anywhere between 20ms and 80ms).

I find it strange that the issue is this intermittent. The first time it happens takes 30 seconds, then after that it might do it again, and it might not, but if it does it’s several minutes past the first time it occurs. Other than the first time it happens, there’s no recognizable pattern. Is there anything that anyone can think of that might be causing this behavior? It almost feels like maybe there’s a race condition on the gpu?

Now…what I find very strange…I began debugging with nvidia nsight thinking that whatever was happening would be pretty apparent once it happened, however…the issue is non-existent when the opengl debug context is enabled. I confirmed this by adding application code to enable and print opengl debug context messages directly (removing nsight from the equation) and the issue is non-existent as well. As soon as I disable the opengl debug context, issue presents itself again.

I know there’s probably no quick easy answer to why this is happening, but I wanted to see if anyone else has encountered behavior such as this, or would know of any other means in which I might track down what is going on.

Below is a stripped down version of what’s going on (if anyone would like to see the full codebase it can be accessed via the “massive_refactor” branch in the repo:

// Scene::draw occurs every tick, there's a lot more code within the 
//actual method, this is just the relevant part

void Scene::draw(float alpha)
{
	// This reports that useHdr is the bottleneck
	if (r.getShader() != gpu->getActiveShader())
		gpu->useShader(r.getShader());
	if (s.getActiveHdr() != gpu->getActiveHdr())
		gpu->useHdr(s.getActiveHdr());
	if (r.getMesh() != gpu->getActiveMesh())
		gpu->useMesh(r.getMesh());
	if (r.getMaterial() != gpu->getActiveMaterial())
		gpu->useMaterial(r.getMaterial());

	// This reports that useMaterial is the bottleneck, note that
    // after useShader, we first call useMesh which does not exhibit
    // the same behavior
    /*
	if (r.getShader() != gpu->getActiveShader())
		gpu->useShader(r.getShader());
    if (r.getMesh() != gpu->getActiveMesh())
		gpu->useMesh(r.getMesh());
	if (r.getMaterial() != gpu->getActiveMaterial())
		gpu->useMaterial(r.getMaterial());
	if (s.getActiveHdr() != gpu->getActiveHdr())
		gpu->useHdr(s.getActiveHdr());
	*/
	
	
	for (auto& a : r.actors.getAll())
		this->drawActor(a, alpha);
}


void GPU::useHdr(HDR* h)
{
	this->activeHdr = h;
	
	// bind pre-computed IBL data
	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_CUBE_MAP, h->irradianceMap);
	this->setShaderInt("irradianceMap", 0);
	
	glActiveTexture(GL_TEXTURE1);
	glBindTexture(GL_TEXTURE_CUBE_MAP, h->prefilterMap);
	this->setShaderInt("prefilterMap", 1);
	
	glActiveTexture(GL_TEXTURE2);
	glBindTexture(GL_TEXTURE_2D, h->brdfLUTTexture);
	this->setShaderInt("brdfLUT", 2);
}

void GPU::useMaterial(Material* m)
{
	this->activeMaterial = m;

	glActiveTexture(GL_TEXTURE3);
	glBindTexture(GL_TEXTURE_2D, m->albedo->id);
	this->setShaderInt("albedoMap", 3);
	
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, m->normal->id);
	this->setShaderInt("normalMap", 4);
	
	glActiveTexture(GL_TEXTURE5);
	glBindTexture(GL_TEXTURE_2D, m->metallic->id);
	this->setShaderInt("metallicMap", 5);
	
	glActiveTexture(GL_TEXTURE6);
	glBindTexture(GL_TEXTURE_2D, m->roughness->id);
	this->setShaderInt("roughnessMap", 6);
	
	glActiveTexture(GL_TEXTURE7);
	glBindTexture(GL_TEXTURE_2D, m->ao->id);
	this->setShaderInt("aoMap", 7);
}

void GPU::setShaderInt(const std::string& name, int value) const
{
    // it was mentioned that I may try caching uniform locations in order
    // to resolve the issue, while this is a good idea, and something to be
    // implemented moving forward, I don't think that would be the root cause
    // of this issue (since enabling the debug context resolves it)
	glUniform1i(glGetUniformLocation(
		this->activeShader->id,
		name.c_str()),
		value);
}

RESOLVED(but not really)

I was able to track down the issue to methods which were setting uniform data. At which point caching the uniform locations as was suggested to me elsewhere seemed like the logical thing to do. After doing so this issue has resolved itself. But that leaves me with two questions:

  1. Why did making so many calls to glGetUniformLocation cause this behavior? I can see where a hit in performance would make sense, but how could it have cause such a long random block?

  2. Why did enabling the opengl debug context resolve the issue?

EDIT
So it appears I was a little too eager in reporting that resolution. Issue persists, only now it’s not being reported in my application profiler. If I log frame times this is what I’m seeing:

Take 1:

29.9445: 0.0041667
29.9487: 0.0041667
29.9528: 0.0041667
29.957: 0.0041667
29.9612: 0.0041667
29.9863: 0.0251476
29.9905: 0.0041667
29.9946: 0.0041667
29.9988: 0.0041667
30.003: 0.0041667
30.0071: 0.0041667

Take 2:

29.8264: 0.0041667
29.8306: 0.0041667
29.8347: 0.0041669
29.8612: 0.0264651
29.8654: 0.0041667
29.8695: 0.0041667
29.8737: 0.0041667

[quote=“whitwhoa, post:2, topic:107622”]
So it appears I was a little too eager in reporting that resolution. Issue persists, …

…only now it’s not being reported in my application profiler. If I log frame times this is what I’m seeing:

These are fun to track down, aren’t they?

Keep in mind this could just be some implicit sync that’s going on in the drivers. So be cautious about assuming that where in your CPU command queuing that you see a slowdown indicates where the problem is actually coming from. It could be though.

Before we dive into what you’re doing exactly, how are you profiling? Timing method, NVIDIA Settings → Manage 3D Settings options related to perf, OS, VSync, Fullscreen Y/N, GPU synchronization, etc. And what GPU/GPU driver? Are you relatively sure that your app isn’t just getting pre-empted for 20-80 ms? How have you established that?

Once you’re sure the base timing method is dependable, I’d remove everything from your draw loop and verify that this problem “goes away”. Then you can re-enable things selectively until it comes back. Should help correlate what you’re doing (if anything) is causing this problem.

These are fun to track down, aren’t they?

They most certainly are :smiley:

  • Profiling using the following VS extension
  • Timing - std::chrono
  • OS - Windows 10 x64
  • VSync - Off via glfwSwapInterval(0)
  • Fullscreen - happens regardless of whether or not in fullscreen or windowed mode
  • GPU synchronization - relying upon glfwSwapBuffers method, more on this below
  • GPU - RTX3070
  • Nvidia driver version - 471.41

Are you relatively sure that your app isn’t just getting pre-empted for 20-80 ms?

Not sure I follow what you mean here.

Also, probably worth mentioning that when recompiling glfw to include profiler functionality the random 20-80ms block shows back up, but I’m unable to get the names of the functions out of the profiler at the moment. Have reached out to extension author as this appears to be a known issue. Having that information should be quite useful.

After you had mentioned checking GPU synchronization, it made me wonder what would happen if I called glFinish AFTER the call to glfwSwapBuffers (I had tried calling it before swapping buffers but never after), and so far after having done so, I have yet to encounter the issue. Should it be required that glFinish be called after swapping buffers? I was under the impression that swapping buffers did the sync?

No experience with this one. I have seen profiling tools introduce spikes due to flushing output though. I’d make sure you can repro your frame time spike outside of MSVS w/o it and MicroProfiler in-the-loop. MSVS is a pig with all kinds of perf problems, and there’s no telling whether this may have something to do with the problem you’re seeing.

Ok, sounds good.

Re GPU synchronization, one tip. I’d suggest that for profiling purposes only, you add a glFinish() after calling SwapBuffers() + glClear( window ), and time your frames like this:

  1. glClear() window
  2. glFinish()
  3. <----------------- stop/start frame timer here
  4. submit frame
  5. glfwSwapBuffers()

Why I suggest that is 3-fold:

  1. This accounts for the entire amount of time required to render “this” frame, both CPU and GPU – and this frame alone!
  2. This avoids implicit sync problems due to queue-ahead (i.e. simultaneous CPU queuing of frame N and GPU execution of frame N-1, N-2, and/or N-3, sometimes resulting in added “resource contention” in the driver)
  3. This ordering also supports frame timing well with both VSync OFF and VSync ON (more details below if you care).

Ah! That’s a good lead! It could very well be cross-frame implicit sync issues cropping up in the driver. This can happen for instance when changing a resource on the CPU for which an unexecuted draw is queued in the command queue.

Also, check your GPU settings in NVIDIA Control PanelManage 3D Settings, make sure you have:

  • Power management mode = Prefer Maximum Performance

I’ve seen all kinds of oddball frame spike issues caused by dynamic power management (PowerMizer or whatever they’re calling it now) in the NVIDIA driver. This tends to avoid those problems. Also in general I’d suggest:

  • Threaded Optimization = Off
  • Triple Buffering = Off
  • Low Latency Mode = On
  • Monitor Technology = Fixed Refresh

Preempted as in “kicked off the CPU”. Not running.

If so you should be able to see this in Nsight Systems. I’d do an Nsight Systems run on your app without markup enabled, and one with markup enabled. KHR_debug markup works, but I find that NVTX markup works even better with Nsight Systems as it seems to be flushed more frequently leading to more accurate start/stop times (KHR_debug events sit in the queue until flushed by GL).

Once you repro the problem in Nsight Systems, you can tune which GL calls occur in the instrumentation to nail down which one(s) might be causing CPU-side bottlenecks. Also note that you don’t need markup (KHR_debug or NVTX) enabled to see the GL calls for which instrumentation collection is enabled.

Use Nsight Systems. You can get the GL calls with that.

No. However, if you wish to limit driver queue ahead for your own reasons, you can use this or sync objects to do so. glFinish() totally prevents queue-ahead (and so is useful for obtaining consistent frame timings for profiling). Whereas with sync objects you have more flexibility and can bound how much queue ahead you’ll allow the driver to do (for instance limit it to 0, 1, or 2 frames queue-ahead – controllable by you). The more queue-ahead, the more potential for resource contention problems between frames leading to implicit sync. However, when not profiling, for the absolute best frame times, you do want some driver queue-ahead as this lets you spend a little more time queuing your frames on the CPU.

No. It only queues a request for the buffer swap to be performed, but in general it does not wait for it (or even wait for frame rendering to finish on the GPU). If you want to wait for the GPU to finish rendering the image, you have to tell the driver to wait for it (e.g. with glFinish() after SwapBuffers, or equivalent with sync objects).

Also related to this, and regarding that frameloop timing ordering I listed above…

From the GL side, with pure GL, you can’t literally wait for the buffer swap to occur and the image to be displayed. This happens at the back end of the window swap chain, and you don’t have visibility into that – with GL.

However…, you do have visibility into when the window image you’re submitting:

  1. completes rendering on the GPU,
  2. is inserted into the front end of the window swap chain, and
  3. the driver has obtained a new swap chain image to render the next frame into.

That occurs right after SwapBuffers() + glClear( window) + glFinish(). Or if running VSync ON, then when your app renders within its time budget, this point will generally line up with the VBlank clock. The reason is that in this case, new swap chain images are available for #3 only on VBlank clocks. So when one frees up, then your app acquires it, allowing the glClear( window ) operation to complete. Note that glFinish() here does an immediate blocking wait, but you could just as well use sync objects to do a deferred wait.

Also regarding your swapbuffers sync question… If you render in a non-fullscreen window, the NV driver will often block your app in SwapBuffers(). However, if rendering to a fullscreen window with focus, the NV driver will often queue SwapBuffers() with no wait and return immediately, allowing you to proceed queuing work for subsequent frames. This is called queue ahead and can be influenced by some settings in Manage 3D Settings. Of course, when desired, you can take control yourself and limit how much queue ahead you want in your app by using glFinish() (as above) or sync objects.

BTW, one other though for you…

Do you have a glDebugMessageCallback() plugged in? The NVIDIA driver is very good about warning the application when the app does something that could cause reduced performance. Make sure the driver isn’t trying to warn you about one of these.

Wow! You’ve given me a ton to look into! THANK YOU! I will be going through your recommendations and reporting my findings.

Do you have a glDebugMessageCallback() plugged in?

I have it configured where I can enable/disable the opengl debug context, and currently it is disabled as the issue disappears when enabling the debug context. When I do enable it though, I only ever received one low priority message about one of my textures missing mipmaps, which upon resolution appeared to make no difference in regards to the ongoing issue.

Also, I was able to get the function names out of glfw where the time spike is occurring and it appears to be happening at the method responsible for swapping buffers:

static void swapBuffersWGL(_GLFWwindow* window)
{
    if (!window->monitor)
    {
        if (IsWindowsVistaOrGreater())
        {
            // DWM Composition is always enabled on Win8+
            BOOL enabled = IsWindows8OrGreater();

            // HACK: Use DwmFlush when desktop composition is enabled
            if (enabled ||
                (SUCCEEDED(DwmIsCompositionEnabled(&enabled)) && enabled))
            {
                int count = abs(window->context.wgl.interval);
                while (count--)
                    DwmFlush();
            }
        }
    }

    SwapBuffers(window->context.wgl.dc);
}

One more thing worth mentioning is that if I remove my draw logic from my loop (never call Scene::draw) so I’m simply just clearing and then swapping buffers every tick…issue persists 0_0…which as I’m typing this I’m starting to think that maybe this is some glfw specific issue, and that it might be worth my time to create an mvp project outside of my existing project to see if it can be reproduced.

Don’t neglect DWM (the desktop compositor) as a possible culprit. Try rendering to a fullscreen, focused window. Your rendered image takes a faster path to the display. If it’s not fullscreen and focused, then it’s routed through the long path through DWM that does composition with all the other desktop content, and you’re not truly synchronizing with the monitor’s VBlank clock.

Try rendering to a fullscreen, focused window.

Can confirm issue persists within a focused fullscreen window.

Also, check your GPU settings in NVIDIA Control PanelManage 3D Settings, make sure you have:

  • Power management mode = Prefer Maximum Performance

I’ve seen all kinds of oddball frame spike issues caused by dynamic power management (PowerMizer or whatever they’re calling it now) in the NVIDIA driver. This tends to avoid those problems. Also in general I’d suggest:

  • Threaded Optimization = Off
  • Triple Buffering = Off
  • Low Latency Mode = On
  • Monitor Technology = Fixed Refresh

Alright…here’s where it gets exciting! I set power management mode as suggested. I then went through each of the other options one at a time enabling/disabling as I went. It appears Threaded Optimization = Off resolves the issue.

I did a full release build (without any of the profiling code in the final .exe, just to be on the safe side), set to fullscreen and to cap fps at 1000. When running the application with Threaded Optimization = Off fps never dips below 1000…rock solid.

If I set Threaded Optimization = On and reload the application, right on schedule about 30seconds into runtime, app hangs for a frame and fps drops.

After doing some googling, I came across this stackoverflow post where the op describes his situation VERY similar to how I do, and apparently this is a known issue. Further down the thread a user explains how to create a “custom Application Profile” using nvidia’s api where you can change this setting from your application. Not an ideal solution, but better than having to have a screen telling users to do it manually. Also, that thread is from 2016 0_0. So 5 years later and this is still a common issue? Odd…

Very interesting results! Thanks for following up with this! And the stackoverflow link is excellent too.

I only mentioned “Threaded Optimization ON” because I’ve personally hit perf problems with this before, so I habitually turn it OFF now. It makes assumptions (most of which I don’t know about) about how the application does or does not use GL.

Related to this, one thing NVIDIA has stated repeatedly in the past is that if you do Buffer Object Streaming using GL_MAP_UNSYNCHRONIZED_BIT, it will not play nice with their GL driver in threaded mode (Threaded Optimization = ON). Here’s one of the presentations they mention this in:

That said, I have no idea what else Threaded Optimization ON assumes about how the app uses GL. And it’s very hard to optimize an app to make best use of a driver feature when they don’t tell you how to do it.