multithreaded OpenGL WTF?

Rob_Barris · December 12, 2006, 1:59pm

Originally posted by def:
Hmm, I always thought the underlying hardware was responsible for OpenGL being single threaded…
If “Multithreaded OpenGL” means I can do CPU work 80% faster (WoW), that’s great, but OpenGL is still the same as before.
Who is saying that actual raw OpenGL performance is getting better through Multithreaded OpenGL? Raise your hands, please.
Hand raised.

World of Warcraft can be up to twice as fast on OS X with the MT-GL mode on. No new threads on the application side.

Edit: But MT-GL can’t raise frame rates if you are running at the GPU limit already. Basically it raises the odds for any given scene that you will be able to keep the GPU running at its limits.

gdewan · December 13, 2006, 5:00am

Originally posted by andras:
[b] Well, just for the record, there’s a multithreading switch in the latest nVidia drivers for WinXP (only present, if you are running on multi-core CPUs), but it actually made our app run a lot slower, so we had to turn it off! YMMV…

… [/b]
Where is this set? I have a 8800 GTX here on a dual core WinXP machine and I can’t find the setting anywhere.

Stephen_H · December 13, 2006, 4:36pm

Could someone explain what parts of the driver could be multithreaded that would increase WoW’s framerate by 80%?

I personally don’t see how the driver could take more than 5 to 10% of CPU time in WoW or any commercial game. Afaik the driver mostly does verification and checking and then sticks the information/gl commands onto a FIFO for the card to pull.

I can see that multithreading might help but only in certain specific situations for example… if you are supplying RGBA uint8 textures and Nvidia drivers like BGRA uint8, the driver might do a conversion on the CPU to BGRA.

There seems to be something I’m not understanding here… does the CPU actually have to waste cycles feeding data from the FIFO to the card?

system · December 13, 2006, 4:38pm

Originally posted by Korval:
[quote]What evidence is there that there is benifit?
Evidence? An 80% performance improvement in WoW isn’t good enough for you? [/QUOTE]80% increase in FPS? I think that is really large and it would be good to know the reason. Since lots of games are said to be GPU limited, and many other games CPU limited because of AI and physics, why is WoW GL driver limited?

It makes me think something is not well coded.

Mars_999 · December 13, 2006, 6:01pm

I can say this, Rob Barris knows his stuff… So you need to trust him when he says its 2x as fast you can bet on it.

Rob_Barris · December 13, 2006, 7:36pm

So, picture a few different programs, all single threaded.

Program A is totally application CPU-bound. Say a fractal generator cranking out texture animations to play back on some spinning quad. 95% application-work, 5% driver-work. MT-GL won’t help (well it might help 5% by getting that driver work to run concurrently instead of in-main-thread). Restated, the app was not being held back significantly by the driver work taking place.

Program B is the opposite of A, how about a fancy Pong game using OpenGL, except it doesn’t do very good state control, switches shaders too often, basically does a lot of stuff incurring CPU work in the driver. Say it has the opposite ratio of work - 5% application, 95% driver. MT-GL won’t help much here either; a 5% benefit again.

http://en.wikipedia.org/wiki/Amdahl's_law

Now, consider Program C: say its work balance can vary drastically depending on what is going on - it might be 80% app and 20% driver, or in some really rough situations it might be 50% app and 50% driver. Scene dependent.

Program C’s benefit from MT-GL will therefore also vary between 20 and 50% reduction in clock time assuming the application thread avoids making any calls that result in synchronization between the app side and the driver side (queries, readbacks, a few other cases).

Thus the “up to 2X faster”… in some weird cases maybe even a little better than 2X when you have less cache contention going on between app-land and driver-land. “2X faster than what?” -> in comparison with the same scene rendered with MT-GL off.

I haven’t seen any claim from Apple saying that this technique is novel or unique to OS X.

If you have WoW on OS X (Intel Mac dual core) you can flip the MT stuff on and off in-game:

/console glfaster X

where X = 0, 1, or 2

0 = off
1 = MT on but with a bit of frame throttling
2 = MT on, no throttling, some mouse lag can occur.

knackered · December 13, 2006, 11:01pm

I don’t get this, doesn’t everyone have a dispatching draw thread anyway?
app/cull/draw anyone?
the draw thread is just issuing gl commands from a queue of draw messages…or is that just me and SGI?

Jan · December 14, 2006, 1:44am

Maybe they coded WoW completely in immediate mode…

Knackered: I don’t think many people are doing that. On single-core CPUs that doesn’t make sense (which we were dealing with up to now). SGI on the other hand, certainly did have multi-processor machines for a long time, don’t they?

Jan.

knackered · December 14, 2006, 3:59am

Well I suppose we’ve been delivering systems targeted at multi-processor x86 systems for literally years now. But I still can’t understand why someone would deliberately engineer a system that couldn’t be easily separated out into threads. You’re talking a simple FIFO…pushing a few bytes into a queue for every draw op. It’s just proper modular programming.
For me, this sounds like the driver being forced into doing optimizations that the app should be doing - if it were not for the games industries addiction to hiring cheap graduates with no programming skill or experience. No offence Rob.
Thank god it’s an optional feature, or my software would be paying a heavy price in double the CPU synchronization for other peoples laziness/ineptitude.

Rob_Barris · December 14, 2006, 7:03am

WoW is not coded in immediate mode, it uses VBO’s, PBO’s, pbuffer or FBO, drawelements, ARB vertex/pixel shaders…

The idea of rolling our own “rendering thread” has come up before, but the subtleties involving two way communication between low level and high level code - esp for titles where assets are being dynamically loaded using async I/O at almost all times - made it a tough sell.

Also, consider if you had three different games by three different teams with three unique engines. Re-writing each engine to have a rendering thread means duplication of work on the developer side. When the command queuing and parallel processing is provided as a standard feature by the implementation, this is less work for the developer(s) to have to worry about.

I really don’t see anything wrong with an OS level feature that allows single-threaded renderers to still utilize multiple cores and run faster. If some number of developers find that they can make a few changes and achieve the FPS benefit that we got with WoW, I’m not sure who should be bothered by that outcome.

On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?

knackered · December 14, 2006, 7:35am

Depending on the subtleties of a particular driver implementation to determine whether your application is interactive or not (say 80% slower than 60fps), when you have an alternative that will guarantee it doesn’t strike me as good sense. But if it works for you, and I’m sure you’ve done your field work, then all’s well with the world…you’re a braver man than I.

On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?
If you’re referring to my last sentence, I was saying that if I detect I’m running on multiple cores, I’ll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue. Now the driver kicks in and decides to fork off it’s own thread for dispatching gl commands, with the very small (but real) overhead of a guarded message queue. Hence double the sync overhead for every GL command.

Rob_Barris · December 14, 2006, 8:16am

Originally posted by knackered:
[b] Depending on the subtleties of a particular driver implementation to determine whether your application is interactive or not (say 80% slower than 60fps), when you have an alternative that will guarantee it doesn’t strike me as good sense. But if it works for you, and I’m sure you’ve done your field work, then all’s well with the world…you’re a braver man than I.

[quote]On the topic of perceived extra CPU synchronization, how would an OS provided MT implementation have any more or less synchronization overhead than a thread you authored yourself?
If you’re referring to my last sentence, I was saying that if I detect I’m running on multiple cores, I’ll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue. Now the driver kicks in and decides to fork off it’s own thread for dispatching gl commands, with the very small (but real) overhead of a guarded message queue. Hence double the sync overhead for every GL command. [/b][/QUOTE]You’ve described a good rationale for making the new behavior opt-in, which is what Apple GL does. We’re glad the old single threaded mode is still available, since it can make some debugging and profiling tasks easier, and it’s in fact the only mode available on single core CPU’s.

edit: while we’d be happy if every one of our users was getting 60FPS, but due to the wide spread of machines that our games run on, it’s not uncommon to find users playing at 5-10FPS and other users playing at 120+ FPS. The attribute of “interactivity” is not a Boolean.

Stephen_H · December 14, 2006, 10:07am

If you care to elaborate, I’m curious if you did any profiling/testing to see which gl operations were using up most of your driver’s CPU time? Also curious, are OSX drivers any better/worse than say Nvidia’s GL driver for chomping up CPU time?

knackered · December 14, 2006, 11:33am

gizza job, rob.

Rob_Barris · December 14, 2006, 1:06pm

Originally posted by Stephen_H:
If you care to elaborate, I’m curious if you did any profiling/testing to see which gl operations were using up most of your driver’s CPU time? Also curious, are OSX drivers any better/worse than say Nvidia’s GL driver for chomping up CPU time?
Yes, we did quite a bit (of profiling). We didn’t do any comparative benchmarking against PC GL; closing the gap with Direct3D/WinXP when tested on the same hardware was higher priority.

PC GL presently lacks some of the extensions we’re now counting on with OS X, such as flush_buffer_range, so that would have been a bit of a skewed test as well.

Jan · December 14, 2006, 1:39pm

Rob, Wow being written in immediate mode was a joke. I didn’t want to offend anyone. I just wanted to point out that apps using immediate mode will most certainly benefit very much from a multithreaded driver, since that uses even more CPU resources.

Jan.

knackered · December 14, 2006, 9:35pm

Originally posted by Jan:
Maybe they coded WoW completely in immediate mode…
Sounded unequivocal and unambiguous to me.

Jan · December 15, 2006, 2:01am

Nice. I learned two new words, and still don’t know how you actually meant that.

Rob_Barris · December 15, 2006, 2:30am

Originally posted by Jan:
[b] Rob, Wow being written in immediate mode was a joke. I didn’t want to offend anyone. I just wanted to point out that apps using immediate mode will most certainly benefit very much from a multithreaded driver, since that uses even more CPU resources.

Jan. [/b]
(wasn’t offended, really!)

Good point about immediate mode apps though I can see how the much higher API-call frequency could potentially give the command queuing mechanism a real workout.

Though, a tuned implementation could batch up everything between a glBegin and glEnd and then submit that to the command queue as a single blob; have no idea which drivers might do this already.

Rob_Barris · December 15, 2006, 2:37am

Originally posted by knackered:
I was saying that if I detect I’m running on multiple cores, I’ll put my drawer in another thread, which has the very small (but real) overhead of a guarded message queue.

It occurred to me that there are ways to implement one-writer / one-reader FIFO’s without explicit mutexing on every queue transaction; with a ring buffer and atomic fetch-and-adds you can do it in a lock-free style. You might already be doing that ?

If mutexes are the only way to go on a given platform, there’s also a way to set something like this up by incurring a little bit of command batching, and only use the lock to obtain larger batch buffers from a pool… you can amortize the locking over a larger numer of transactions that way, trading latency for speed.

It might get a little twisty with variable sized commands of course.