NVidia multithreading problems

I’m currently researching into methods using multiple GPUs in one system. I now have a setup of 2 Quadro FX 1700 cards in a dual-core machine. Using NV_gpu_affinity I am able to setup two rendering threads, each rendering on a separate GPU.

I’m making heavy use of occlusion queries, with up to 512 queries in-flight at the same time. Unfortunately, after some seconds, the driver either crashes or lockes up in glGetQueryObjectivARB(query, GL_QUERY_RESULT_ARB, &result). Without the occlusion queries, the program works as expected.
The lockup happens regardless of the “Threaded Optimization” setting in the driver settings.

A second phenomenon I experience is, that the whole thing slows down much more than to be expected. I was expecting that with two GPUs and two views, it would run slightly slower that just one view on one GPU. But I experience much worse performance. A first profiling told me, that very much time is spend on nvoglnt.dll (24%), ntoskrnl.dll(17%) and ntdll.dll(11%). Only 6% is spend in my own code. Especially ntoskrnl.dll and ntdll.dll seem to point out that there’s much thread-synching going on…
But my rendering threads do no locking/synching while rendering, so I assume, its somewhere inside the driver :-/

Has anyone made similar experiences? How to harness the full power of two separate GPUs rendering in two separate threads?

thanks in advance!

I suspect this is a driver issue - have you talked to nVidia about this?

One thing to try (if possible) is to use multiple processes. I’ve heard that this can be faster in certains codes.

I’ve tried something similar, with 3 GTX 260s running in parallel under Linux, and found that performance does not scale as you’d expect. Each GPU was controlled from a separate thread with no resource sharing between contexts. The best I could do was a properly placed sleep or glFinish that would give back most of the expected performance, but not all of it (glFinish is a bit heavy-handed).

There was a memory leak with occlusion queries in nvidia drivers, it may not be fixed in released drivers yet. Try checking the query status with GL_QUERY_RESULT_AVAILABLE_ARB after issuing it, as a possible workaround.

Instead of of heavy handed glFinish(), maybe glFlush() can help ?

skynet, can you provide me with a repro case (Source code preferrable). We’ll take a look.


(with my NVIDIA hat on)

I forgot to mention that I’m using the “Quadro Release 178” 178.46 drivers on WinXP64.

Some new findings:

  1. the occlusion-query lockup also appears on a single-gpu setup (no NV_gpu_affinity involved) where two threads (two separate contexts) render on one GPU at the same time.

  2. NV_gpu_affinity only enumerates both GPUs, if the driver settings are set to “Multi-Display-Performance-Mode”. In all other modes, just one GPU is found. Why is that?

  3. I have created a double-buffered window, took its Pixelformat-ID (GetPixelFormat()) and created an affinity-DC with this pixelformat. Now glGetIntegerv(GL_DOUBLEBUFFER) returns ‘1’, even when this affinity-DC is made current (which itself has no window-provided framebuffer!). As soon as I call wglSwapLayerBuffers() or SwapBuffers() on the affinity-DC, the application crashes inside the driver. This should not happen. I expect the call to get ignored or to return an error, but not a crash.

One thing to try (if possible) is to use multiple processes. I’ve heard that this can be faster in certains codes.

I prefer threads, because I need both render-threads to share resources in main memory.

Instead of of heavy handed glFinish(), maybe glFlush() can help ?

I do not understand how inserting one of these should improve performance or stability?