NVidia multithreading problems

skynet · December 17, 2008, 4:28am

I’m currently researching into methods using multiple GPUs in one system. I now have a setup of 2 Quadro FX 1700 cards in a dual-core machine. Using NV_gpu_affinity I am able to setup two rendering threads, each rendering on a separate GPU.

I’m making heavy use of occlusion queries, with up to 512 queries in-flight at the same time. Unfortunately, after some seconds, the driver either crashes or lockes up in glGetQueryObjectivARB(query, GL_QUERY_RESULT_ARB, &result). Without the occlusion queries, the program works as expected.
The lockup happens regardless of the “Threaded Optimization” setting in the driver settings.

A second phenomenon I experience is, that the whole thing slows down much more than to be expected. I was expecting that with two GPUs and two views, it would run slightly slower that just one view on one GPU. But I experience much worse performance. A first profiling told me, that very much time is spend on nvoglnt.dll (24%), ntoskrnl.dll(17%) and ntdll.dll(11%). Only 6% is spend in my own code. Especially ntoskrnl.dll and ntdll.dll seem to point out that there’s much thread-synching going on…
But my rendering threads do no locking/synching while rendering, so I assume, its somewhere inside the driver :-/

Has anyone made similar experiences? How to harness the full power of two separate GPUs rendering in two separate threads?

thanks in advance!

eile · December 17, 2008, 5:15am

I suspect this is a driver issue - have you talked to nVidia about this?

One thing to try (if possible) is to use multiple processes. I’ve heard that this can be faster in certains codes.

AlexN · December 17, 2008, 8:18am

I’ve tried something similar, with 3 GTX 260s running in parallel under Linux, and found that performance does not scale as you’d expect. Each GPU was controlled from a separate thread with no resource sharing between contexts. The best I could do was a properly placed sleep or glFinish that would give back most of the expected performance, but not all of it (glFinish is a bit heavy-handed).

There was a memory leak with occlusion queries in nvidia drivers, it may not be fixed in released drivers yet. Try checking the query status with GL_QUERY_RESULT_AVAILABLE_ARB after issuing it, as a possible workaround.

ZbuffeR · December 17, 2008, 8:44am

Instead of of heavy handed glFinish(), maybe glFlush() can help ?

barthold · December 17, 2008, 9:45am

skynet, can you provide me with a repro case (Source code preferrable). We’ll take a look.

Thanks,
Barthold

(with my NVIDIA hat on)

skynet · December 17, 2008, 12:43pm

I forgot to mention that I’m using the “Quadro Release 178” 178.46 drivers on WinXP64.

Some new findings:

the occlusion-query lockup also appears on a single-gpu setup (no NV_gpu_affinity involved) where two threads (two separate contexts) render on one GPU at the same time.
NV_gpu_affinity only enumerates both GPUs, if the driver settings are set to “Multi-Display-Performance-Mode”. In all other modes, just one GPU is found. Why is that?
I have created a double-buffered window, took its Pixelformat-ID (GetPixelFormat()) and created an affinity-DC with this pixelformat. Now glGetIntegerv(GL_DOUBLEBUFFER) returns ‘1’, even when this affinity-DC is made current (which itself has no window-provided framebuffer!). As soon as I call wglSwapLayerBuffers() or SwapBuffers() on the affinity-DC, the application crashes inside the driver. This should not happen. I expect the call to get ignored or to return an error, but not a crash.

One thing to try (if possible) is to use multiple processes. I’ve heard that this can be faster in certains codes.

I prefer threads, because I need both render-threads to share resources in main memory.

Instead of of heavy handed glFinish(), maybe glFlush() can help ?

I do not understand how inserting one of these should improve performance or stability?