(OT) Multithreading Performance revisited

Some time back I made a test for evaluating the multi threading performance for an OpenGL program under Windows on single and multiple processor systems.

The results were “so-so”, and the benchmark program was not really suited for benchmarking (i.e. the number of drawn polygons per frame was proportional to the FPS count).

I have now made some updates, and hopefully improved multi processor performance (using double buffered geometry data), and I would like to find results based on this new test.

If anyone (with or without a SMP system) could please try this test, I would be very happy: ParticleBench [EDIT: old bad link removed!]

(14 KB download, 8 minutes benchmark)

The demo is a particle engine (rendering 3000 particles), running in fullscreen mode @ 640x480x16 (to avoid fill rate limitations).

[This message has been edited by marcus256 (edited 04-24-2002).]

Just some information concerning the link. Tried download it by clicking on the link, but the server told me it couldn’t find the file. I copied and pasted the link to the address field of my browser, and then it worked. Just in case someone else have the same problem…

Marcus, I try the program when I get home, so no results from me now

Bob, I also had serious trouble downloading the file (sometimes it seems the server does not recognize newly uploaded files). Now it seems to work all right, though (even from the link above).

Link doesn’t work for me either, I get what I think is an error page in a language I don’t know.

-Mezz

Originally posted by Mezz:
Link doesn’t work for me either, I get what I think is an error page in a language I don’t know.

It’s Swedish Saying the file does not exist (but I’m sure it does!).

Originally posted by marcus256:
[b] It’s Swedish Saying the file does not exist (but I’m sure it does!).

[/b]

I have the same problem, but just placing the cursor in the address field of the new window and pressing enter will start the download.

I think this is related to the way this forum handles the automatically created URL links in messages, especially if the link is to a non-browser type file (like this zip).

Jean-Marc

http://hem.passagen.se/opengl/glfw/files/ then click on the link from the file listing (it’s probably some anit-leaching protection).

WAY OFF TOPIC RESPONSE FOLLOWS

The odd link behavior is due to the sematics IE uses when submitting HTTP requests, and the semantics of the HTTP server. For example, some HTTP servers will not satisfy a GET request if it has a REFERRER clause that is different from the server itself, except if the item requested is the default item. When you click on the original link above, it includes a REFERRER clause to indicate where the link was clicked at. When you put the link into the address bar and click Go, there is no REFERRER clause.

Ok - to resolve the problem I created a HTML page hosting the file, here: ParticleBench .

At that page you will also find results that people have sent to me (interesting reading)

MT was just a fraction slower on my system - probably due to context switching (I’m not sure what exactly you are doing) - but its so tiny it isn’t noticeable visually (nice demo by the way!).

However, I’m sure structurally, MT can help spread CPU burden - depending on the number of threads of course.

Robbo,

Could you please send me the results? (I’m maintaining a results list - and I’m very interested in how different systems behave).

I will release the source some day. Unfortunately it is a bit ugly since it is a one-file patchwork. I should split it into more logical sections. (the thing that makes it a bit strange here and there is that I try to reuse as much code as possible for MT and ST mode).

There are two things that cost in MT mode: threading overhead (minimal), and double buffering of particle data. The latter should hopefully help in a dual processor system, but at the same time it hurst cache performance and adds a few mem-move instructions, which is why the demo runs slightly slower in MT mode on single processor systems.

Yea - I emailed them to you this morning.

I’m interested in this too - the system I am currently working on is MT - with lots of stuff going on in the foreground and rendering happening in the background.

The problem I have found is that the gl drivers run at a much higher priority than application processes - so if your thread spends a lot of time in the drivers, it needs to be forced to give way (by sleeping every frame) to avoid starving other threads.

My rendering thread is on low-priority - other tasks in the system (data input from infrared cameras for example) is time critical so must be given more omph (hence the sleeping in the render thread).

Anyway, its a complex issue but for things like games, I guess you’ll want rendering on a higher priority than AI (for example).

[This message has been edited by Robbo (edited 04-25-2002).]

Robbo,

Sorry, did not recognize your name.

Well, for apps like games and “demos” (like the particle demo), which update things on a frame-by-frame basis, it is more or less automatically solved with thread synchronization.

E.g. when the particle renderer needs particle data, it checks if the particle physics thread is done. If not, it waits for a signal from the physics thread. On a single processor system this will result in an immediate context switch from the rendering thread to the physics thread, regardless of thread priorities (a context switch takes about 2 microseconds under Win 2k, an over 30 microseconds under Win 9x). In a multiprocessor system, the physics thread is most likely already executing “in the background”.

When the physics thread is done, it signals the rendering thread, and continues to work with the next frame (unless it is already one frame ahead of the rendering thread).

The problem I have found is that the gl drivers run at a much higher priority than application processes - so if your thread spends a lot of time in the drivers, it needs to be forced to give way (by sleeping every frame) to avoid starving other threads.

I believe this is primarily a result of the sucky, crappy, no-good NT task sheduler, which gives equal priority to all threads/processes within the same priority class, regardless of CPU load. Obviously the OpenGL renderer thread consumes much CPU, and thuse gets the majority of the CPU time. You also have the problem that the task switching granularity is usually in the order of 10-20 ms. If no explicit thread synchronization is made (wait for signal/sleep etc), that is what you get.

[This message has been edited by marcus256 (edited 04-25-2002).]

marcus256,

RE: running all threads in the same process at the same priority… when you create the thread, do you specify THREAD_PRIORITY_ABOVE_NORMAL for the physics engine? The threads priority will then be slightly higer than the original processes priority, which should prevent it being “starved” by the hungry renderer thread. Mind you, this may result in the opposite problem - the engine slowing the renderer… You could also tinker with the priority of your original process… making it slightly higher would starve the rest of the system and feed all your threads…

Another common thing I’ve seen is a dependence on ::Sleep(0), rather than true signaling. I don’t know how you’ve implemented your signaling, but you may want to try ::CreateEvent() and ::WaitForMultipleObjects() if you’re not already. ::Sleep(0) causes a lot of context thrashing, while with ::WaitForMultipleObjects() the “sleeping” thread never gets switched to unless it’s been signaled or timed out…

Glad to see someone working on this - I’ve been thinking someone ought to look into pros/cons of various approaches to multithreaded rendering. I believe my current approach relies too heavily on pBuffers - it works nicely for simple things, but anything complicated requires a ton of memory on the video board for the buffers. All I’m really interested in is a “progressive detail” mechanism that only uses whatever cpu cycles are left over after all the important things going on in the foreground are done. (Say you’re getting data from a sensor. When new data comes in, you send it to the renderer and get a “rough draft” of the updated view. in the meantime you’ve got other threads archiving the new data, forwarding it to clients over the net, responding to user requests, etc. But meanwhile the renderer thread is munching away at iteratively better drafts of the updated view, until new data comes in.)
I’ve found that, particularly in systems that hit the disk or routinely handle network traffic, there’s plenty of spare “pure” cpu cycles being wasted for a simple renderer to get some time.

Well, didn’t mean to write quite so much - now I’m late for work. I’ll d/l the demo tonite and I’ll send results both from here and at work…

-Chris Bond

Originally posted by ChrisBond:
[b]marcus256,

RE: running all threads in the same process at the same priority… when you create the thread, do you specify THREAD_PRIORITY_ABOVE_NORMAL for the physics engine? The threads priority will then be slightly higer than the original processes priority, which should prevent it being “starved” by the hungry renderer thread.
-Chris Bond[/b]

Hi Chris,

I found that altering priorities like that doesn’t make a huge difference because most of the work is done in the driver and who knows what its getting up to at that level.

Even when I set priority to below normal for the render thread, I get GUI lockup (windows, MFC) - perhaps the GUI runs on a very low priority or something - but if I give way in my threadmain (say, 40ms or so) per frame, everything comes back to life.

Originally posted by ChrisBond:
[b]marcus256,

RE: running all threads in the same process at the same priority… when you create the thread, do you specify THREAD_PRIORITY_ABOVE_NORMAL for the physics engine? The threads priority will then be slightly higer than the original processes priority, which should prevent it being “starved” by the hungry renderer thread.[/b]

Sorry, my post was perhaps a bit confusing. I was guessing what Robbo was experiencing.

My app (the particle demo) uses proper signalling (GLFW mutexes and condition variables), so there is no problem with thread sheduling priorities. I think I could set one thread to max pri and the other to min pri, and there would be no difference in performance. With that said, I don’t change thread priorities for my threads (they will both run as NORMAL).

Ive send you my results, would be nice if you could update your site :->

PS: ive not changed my fps:
CPU: AMD Athlon 800Mhz 1CPU
Gfx Card: GeForce2 GTS
OS: Windows 2000 SP2
16245 frames in 60.01 seconds = 270.7 FPS (multithreading off)
14616 frames in 60.01 seconds = 243.6 FPS (multithreading on)

[This message has been edited by T2k (edited 04-26-2002).]

Tester: (chaney /
CPU: (Pentium III/ 1Ghz / 2)
Gfx Card: (Elsa Gloria III / 64MB)
OS: (Windows 2000)


** Single Thread test *


17630 frames in 60.00 seconds = 293.8 FPS (multithreading off)
17832 frames in 60.00 seconds = 297.2 FPS (multithreading off)
17653 frames in 60.01 seconds = 294.2 FPS (multithreading off)
17926 frames in 60.00 seconds = 298.7 FPS (multithreading off)


** Dual Thread test *


19659 frames in 60.01 seconds = 327.6 FPS (multithreading on)
19632 frames in 60.00 seconds = 327.2 FPS (multithreading on)
19660 frames in 60.00 seconds = 327.6 FPS (multithreading on)
19662 frames in 60.00 seconds = 327.7 FPS (multithreading on)