Performance/Scalability issue when running multiple Windows OpenGL applications

Hi,

I have an OpenGL application running in Windows platform, and its main architecture is like following:

Main thread: Doing its own work (no window, nor rendering)
Thread 1: Create window and handle window procedure.
Thread 2: Do OpenGL render work, SwapBuffers().

My problem is that, if I run one such application, about 3% CPU is used. and it scales good till running 5 applications (15% CPU cost).
If run more than 6 application simultaneously, the CPU usage will increase to 100% (each application instance cost about 16% CPU usage), and the whole Windows system becomes very slow.

One finding is that, if launch less than 5 applications, the system clock interrupt frequence is about 3k/sec. However it will increase to more than 10k/sec after launch the 6th one.
Another finding is that, if I comment out the SwapBuffers() function, this issue goes away with very good scalability (Of course I couldn’t see the buffer change on screen)

Any hint or suggestion for this issue?

Thanks!

May it be possible that you have VSync enabled? In that case all the windows will try to synchronize and slow each other down.

I tried both Vsync on and off, however no difference…

[QUOTE=dongxiao;1291364]My problem is that,

[ul]
[li]if I run one such application, about 3% CPU is used. and it scales good till running 5 applications (15% CPU cost). [/li][li]If run more than 6 application simultaneously, the CPU usage will increase to 100% (each application instance cost about 16% CPU usage), and the whole Windows system becomes very slow. [/li][/ul]
One finding is that, if launch less than 5 applications, the system clock interrupt frequence is about 3k/sec. However it will increase to more than 10k/sec after launch the 6th one.
Another finding is that, if I comment out the SwapBuffers() function, this issue goes away with very good scalability (Of course I couldn’t see the buffer change on screen)[/QUOTE]

We don’t have a whole lot of data to go on here.

From what I gather, when you go beating on the same GPU from multiple GL threads, there’s really not much benefit to be had here. It’s a shared resource, and you’re time-sharing it (parallel upload+render being one possible exception). Now that’s totally different from having separate GL threads each targeting a “different” GPU on the same system. With some vendor’s drivers, the latter can be very efficient.

As to why you’re seeing a discontinuity in your scaling, not sure. It could be a limitation of your GL drivers, or your GPU, or you could be overrunning GPU memory, or…

What does your GPU memory consumption look like with 1 instance running? With 2? With 5? With 6?
Have you verified that you’re not overrunning GPU memory? That’s step #1: to ensure that you’re not thrashing GPU memory with all of those instances. If you were, an increase in system interrupts and a performance degradation would be consistent with that.

What GPU and GPU drivers are you using? What OS?

[QUOTE=Dark Photon;1291386]We don’t have a whole lot of data to go on here.

From what I gather, when you go beating on the same GPU from multiple GL threads, there’s really not much benefit to be had here. It’s a shared resource, and you’re time-sharing it (parallel upload+render being one possible exception). Now that’s totally different from having separate GL threads each targeting a “different” GPU on the same system. With some vendor’s drivers, the latter can be very efficient.

As to why you’re seeing a discontinuity in your scaling, not sure. It could be a limitation of your GL drivers, or your GPU, or you could be overrunning GPU memory, or…

What does your GPU memory consumption look like with 1 instance running? With 2? With 5? With 6?
Have you verified that you’re not overrunning GPU memory? That’s step #1: to ensure that you’re not thrashing GPU memory with all of those instances. If you were, an increase in system interrupts and a performance degradation would be consistent with that.

What GPU and GPU drivers are you using? What OS?[/QUOTE]

Thanks for your reply.

The GPU memory consumption scales well from 1-6, and even when running 6 instances, there still has a lot of free GPU memory.

I tried to run my app (64bit) on Win7/Win10 (64bit) with Nvidia, AMD, and Intel GFX, this issue could always be reproduced.

It seems that it is a Windows OS related issue. Will SwapBuffer() has certain usage limitation?

Thanks!

Ok, so it’s not GPU memory. It certainly does sound like you’re running up against some system or driver limitation. The trick is going to be teasing out what that limited resource is, and then determining if you can relax that limitation.

I’d dig around in Process Explorer and look at system resource utilizations when running 1, 5, and 6 processes. See what’s different besides system interrupts between 5 and 6. You might also take a look at GPUView to get a better handle on how data is being queued, pipelined, and executed on the GPU. It may help reveal the bottleneck.

Re SwapBuffer(), that’s just submitting all the GL work you’ve queued to the GL driver. On Windows Vista+, this is also when the compositor (DWM) overhead will come into play (unless you disable it). All of your app rendering is being funneled through this annoying process. As you probably know, your app internally renders to an off-screen buffer (even when your targetting the window aka system framebuffer). This off-screen result is handed off to the compositor which then re-renders it on its own terms. On Windows 7, you can bypass much of this overhead by using a Full-Screen Exclusive mode window. Allegedly on Win8.1+, you can use iFlip/iFlip immediate to bypass much of the overhead. So if you haven’t already, try creating and rendering to Full-Screen Exclusive mode windows on Win7 and see if that changes your 5-6 process results. Try this both with and without VSync on. For VSync off, time your frames (start-of-frame -to- start-of-frame) and verify that your frame rate is not being limited by the compositor’s VSync (i.e. your frame times should be much less than 16.6ms/frame, and you should most likely have high CPU utilization with < 6 processes).

Are you targetting rendering to a window? You could try rendering to FBO instead to see if that changes your bottleneck.

How many cores do you have on your CPU? What percentage of your frame time is your CPU busy? Have you run a profiler such as Very Sleepy on one of your app instances to see what’s different about running with 5 vs. 6 processes? How have you configured your driver to wait on VSync at end-of-frame?

Have you tried any GPUs/drivers from NVidia’s or AMD’s professional line? These may be better optimized for running with a large number of GL processes.

One other thought: If you can post a short stand-alone GLUT test program that illustrates your problem, others here could compile it and give you some feedback on other systems/GPUs/drivers. If there’s something just odd about the system(s) you’re testing on, this could help reveal it.