SLI AFR + RTT = disaster?

Madoc · April 14, 2009, 11:46am

Hi all,

SLI in AFR mode is completely killing performance in our application. We determined this is when doing RTT operations. We’ve tried a simplified rendering now and seen framerates go from ~200 in SFR to 6-8 in AFR with a single 4096^2 16bit shadow map (lower resolutions perform better but still horribly). Without rendering the shadow map, AFR behaves as expected. The same system also behaves as expected with a number of Direct3D applications that support SLI.

The shadow maps are cleared and rerendered every frame (via FBO), the viewport is set to the entire map and scissoring is disabled. The map is not bound before it is updated or by the time SwapBuffers is called. I even tried removing the attachment from the FBO. No luck.

So, what’s the deal? I don’t see how else I could convince the driver that the other GPU does not require a copy of the texture’s current contents. Numerous searches on the net found nothing and I’m following what little documentation there is from nvidia to the letter.

Ilian_Dinev · April 14, 2009, 12:29pm

allocate 2 such textures (and fbos). The driver should be smart to not allocate 2x VRAM in each card, if you never ever use both on a single card.

Madoc · April 14, 2009, 1:13pm

Ok, I’ll give it a try. Thanks.

However, if the driver is smart enough to do that then surely it should be smart enough to deal with the current implementation? From what I’ve understood of the nvidia documentation (which is mostly d3d specific) that should be enough. I have to say I don’t like the the idea of implementing specific code paths, especially when there doesn’t seem to be any good reason for doing so.

Also, seeing as this would obviously require it, how exactly do we keep track of what GPU we’re on? Is swapbuffers what causes the switch (I’ve been assuming it is)? Is this 100% reliable?

The lack of documentation on this is surprising.

Madoc · April 14, 2009, 1:38pm

Hmm, just hacked this in and no change. Same performance problem. In the test application I guarantee a constant fixed rendering loop so I don’t see where any problems could occur.

Rereading the documentation I’m certain this shouldn’t be necessary anyway. Looks to me like OpenGL RTT in SLI is just completely broken. Anyone had any luck with it?

skynet · April 15, 2009, 4:40am

The trick is not to clear the offscreen-FBOs (like the shadowmap) before you start using them, but after having finished using them. This way the driver knows “ahh, this FBO/texture is cleared, I don’t have to transfer its data over to the other GPU” when you finally do the SwapBuffers() call.

http://developer.nvidia.com/object/sli_best_practices.html

Personally, I don’t like SLI/crossfire at all. My hope is that NV_gpu_affinity and AMD_gpu_association will be finally available on all cards, so one can leverage the full potential of two or more graphics cards in multiple rendering threads.

Madoc · April 15, 2009, 6:14am

Didn’t think of that. The documentation (your link) talks about clearing even when not necessary so it would seem to imply clearing before rendering as usual. I would think rebinding the FBO and clearing after use is unusual enough that it should be described more explicitly. It does make sense though.

Unfortunately, the results are still identical. Tried clearing the texture once I’m done with it for the frame and no change at all, 6 fps.

I agree that SLI isn’t great, though I don’t understand what you’d do with the multiple cards anyway. With how slow moving data between them appears to be, I don’t see much practical use. In our case a real application could never perform in a satisfactory way under these conditions, the quantity of RTT we do, with a lot of the textures having a longer lifetime than one frame, is totally unpractical.

The one possible application I see is rendering more distinct environments at once, which is something we support in our engine but so far we’ve never used this in an actual shipping application.

Madoc · April 15, 2009, 6:48am

I just dug out an old app using our previous engine. This uses (every frame) an FBO with multiple (high resolution) colour attachments which are never cleared, it copies the contents of the back buffer several times per frame and doesn’t use PFD_SWAP_EXCHANGE. And yet… SLI AFR performance is great, 400 fps in 1920x1200 and that’s nearly double the single GPU performance.

I must have hit something funny, this and the 6 frames per second for a much simpler app are very inconsitent. Yet I have another example of some simple RGB only 512x512 RTTs that perform horribly in AFR…