Framebuffer copy ms textures on integrated graphics

Hi everyone,

I have no experience in writing OpenGL code for Intel integrated graphics and I was surprised by the amount of GPU time spent on copying framebuffer contents, in particular when it comes to multi-sample textures.

This is the result I got with a simple performance test:

IssueTimestampQuery(timeQueries[_frameCnt % 2 * 2]);

glBindFramebuffer(GL_READ_FRAMEBUFFER, fbo0);
glBindFramebuffer(GL_DRAW_FRAMEBUFFER, fbo1);
glBlitFramebuffer(0, 0, SCR_WIDTH, SCR_HEIGHT, 0, 0, SCR_WIDTH, SCR_HEIGHT, GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT, GL_NEAREST);

IssueTimestampQuery(timeQueries[_frameCnt % 2 * 2 + 1]);
Copy fbo time on Intel Intel(R) UHD Graphics
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 4
>> time elapsed:            7786.49 us averaged over 234 frames

Copy fbo time on Intel Intel(R) UHD Graphics
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 8
>> time elapsed:            16437.7 us averaged over 149 frames

Copy fbo time on Intel Intel(R) UHD Graphics
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 16
>> time elapsed:            36195.7 us averaged over 103 frames

Copy fbo time on NVIDIA Corporation NVIDIA GeForce GTX 1650 Ti/PCIe/SSE2
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 4
>> time elapsed:            510.5 us averaged over 306 frames

Copy fbo time on NVIDIA Corporation NVIDIA GeForce GTX 1650 Ti/PCIe/SSE2
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 8
>> time elapsed:            1437.05 us averaged over 462 frames

Copy fbo time on NVIDIA Corporation NVIDIA GeForce GTX 1650 Ti/PCIe/SSE2
--------------------------------------------------------------------
>> color and depth size:    1080x1080
>> color format:            GL_RGBA8
>> depth format:            GL_DEPTH_COMPONENT
>> color and depth samples: 16
>> time elapsed:            2151.57 us averaged over 591 frames
  1. Are those values reasonable or should I suspect there’s something going on (like non up-to-date drivers, which I updated recently by the way)?

  2. Does it mean that copy operations with multi-sample resources are not feasible during the render loop on Intel uhd graphics? Are there techniques/perf tips I could implement to make the copy faster?

1 Like

What is IssueTimestampQuery doing? Is it calling glGetInteger64v(GL_TIMESTAMP? If so, that’s not effectively timing GPU commands. Getting the timestamp in this way only gives you the GPU time when prior commands have been issued, not completed. So it’s entirely possible that your timer is including the execution of rendering your frame or other things.

Use query object based timestamp queries via glQueryCounter.

This is how I’m retrieving the data I’ve shown:

GLuint64 GetTimeDelta(GLuint qStart, GLuint qEnd)
{
    int ready = 0;
    while (!ready) glGetQueryObjectiv(qEnd, GL_QUERY_RESULT_AVAILABLE, &ready);

    GLint64 start_ns, end_ns;
    glGetQueryObjecti64v(qStart, GL_QUERY_RESULT, &start_ns);
    glGetQueryObjecti64v(qEnd,   GL_QUERY_RESULT, &end_ns);

    return end_ns - start_ns;
}

void IssueTimestampQuery(GLuint q)
{
    glQueryCounter(q, GL_TIMESTAMP);
}

Depends. You haven’t really spec’ed your system for us.

Saying “Intel UHD” is like saying “GeForce”. So what Intel GPU is it? And it uses the system’s DRAM. Which is … what … with what peak read and write bandwidths?

Off-the-cuff, looks like you’re getting ~64.6 GB/sec on NVIDIA GeForce 1650 Ti (with a peak of 192 GB/sec), and ~3.84 GB/sec on the heretofore unspecified Intel UHD Graphics embedded GPU backed by heretofore unspecified DRAM. That is assuming your queries got flushed and processed fairly promptly. At issue here is also going to be how the driver implements the Blit, how it stores the render targets in memory, and (in the case of the Intel) what else it has to contend with for that memory B/W.

With no more info, I’d say sure, seems reasonable. The much higher mem B/W is one of the reasons to go with a discrete GPU. Slow DRAM vs. fast VRAM. That said, to your use case…

Why are you benching this? That is, when in practice would you actually do what you’re talking about as part of a production rendering pipeline?

Typically, you render MSAA COLOR+DEPTH and then downsample blit COLOR-only for display. Even if you were going to do some kind of deferred shading on an MSAA G-Buffer, you’d just be reading from the MSAA COLOR buffer not copying it w/o resolve much less copying MSAA DEPTH too.

So while you can Blit a large MSAA render target to another MSAA render target (COLOR and DEPTH), why would you ever want to?

Ultimately you’re limited by the speed of the memory backing the GPU. Plus driver voodoo you can’t control. Do you need to do this kind of copy? Or do you have a different use case in mind?

You haven’t really spec’ed your system for us.

Sorry, you are right. The system I’m testing is a Dell XPS 15 9500. The CPU is an Intel i7-10750H (UHD Graphics 630) paired with 16 GB of DDR4 sdram at 2933 MHz.

Why are you benching this? That is, when in practice would you actually do what you’re talking about as part of a production rendering pipeline?

I’m working on a CAD component. In order to allow for smooth zoom/pan/rotate movements even when large CAD models are loaded into the scene (thousands of individual entities), only a fraction of the entities are rendered while the camera is moving. However, when the scene becomes static, I start drawing the scene progressively, by adding new entities on top of the previously drawn ones (to avoid the sudden popping of new entities after a significant delay, and to keep the scene responsive).
That doesn’t require copying fbo contents back and forth in itself, however, the scene contains also an overlay part (semi-transparent labels, dimensions, …) that should always be drawn on top of the polygonal part.

So the approach I’m using to draw progressively over N frames is:

  • draw the scene on a target that already contains the previously draw geometry (without clearing it first)
  • copy the content of the target to the default framebuffer
  • draw the overlays
  • display the scene
  • repeat

This is why I need to copy ms texture (unless I switch to other AA methods).

I thought that if this copy operation is too expensive on some systems, maybe I can fallback to allowing MSAA only inside the same frame, by resolving the ms color texture used to store the accumulated geometry and drawing that onto the default framebuffer, while also drawing only one sample of the depth buffer.

Ok. So a laptop with maybe just under ~50 GB/sec theoretical peak bandwidth, with considerably less in practice, … especially if the laptop is throttling back to meet power or temp targets.

Ok. That last would seem to radically reduce your GPU memory bandwidth needs … vs. full-res copying of MSAA COLOR and MSAA DEPTH.

What about:

Setup:

  • Create FBO A with MSAA COLOR and MSAA DEPTH.
  • Create FBO B with 1X COLOR-only.

Each frame:

  • Draw the scene on a FBO A that already contains the previously draw geometry (without clearing it first)
  • Downsample Blit FBO A’s COLOR to FBO B
  • Draw the overlays on FBO B
  • Display the scene
  • Repeat

That’ll radically cut your memory bandwidth and space needs. And with the “overlay part … should always be drawn on top of the polygonal part”, then you don’t need DEPTH from the CAD geometry to draw the overlays. Much less MSAA COLOR or DEPTH.

If I understand what you’re suggesting correctly, then with this you’ll end up with depth/occlusion artifacts between old, resolved geometry (that was MSAA COLOR+DEPTH but is now 1X COLOR-only) and new geometry.

If you need MSAA for the CAD geometry, then you need to keep the original CAD geometry render MSAA. If not, drop FBO A back to 1X.

Also… It’s probably occurred to you already, but…

An alternate solution to your problem is to improve the drawing efficiency of your CAD geometry rendering so that it can always be rendered in one frame. Then everything gets drawn MSAA COLOR and DEPTH each frame (to a cleared FBO) with a simple COLOR downsample to the default framebuffer at the end.

1 Like

Thanks @Dark_Photon for all the suggestions. Just another question:

If FBO B is used to display the accumulated geometry + overlays and is never used as a shader input (no post-processing whatsoever), shouldn’t I use the default framebuffer (asking for a non multi-sample format at initialization)?

Sure. If you have complete control of the format of the default framebuffer (so you can be sure it’s 1X), and you have no other need for FBO B, you can of course get rid of it.


You probably don’t care about this next, but just in case…

One way where the former can go awry is if the graphics driver allows the user to “force” the window framebuffer to have a different multisampling format than the application requests internally. This can result in the window actually having an MSAA format (with some user-specified number of samples) even though your application requested 1X (no MSAA). And this forcing can instigate Blit errors and a blank window when you try to Blit from FBO A (with X number of MSAA samples) to the window (which has Y number of MSAA samples, where possibly X != Y).

In this “force the window format” use case, a Blit path of FBO B (MSAA) → FBO A (1X) → Window (MSAA ? or 1X?) should always work, assuming the same dimensions for each, because you’re never trying to Blit between two MSAA framebuffers with a different number of samples. However, this adds an extra Blit and thus likely reduces performance.

An alternate, possibly-better, way to handle this case (if you care) is to check the format of the window framebuffer your app has been provided with, verify that it is what it requested, and if not, terminate your app with a fatal error, telling the user to stop messing with the graphics driver override settings (and telling them how to fix the settings, for the GPU driver vendors you plan to support).

1 Like