Strange loss of performance

Hi! My name is Morgan Johansson and I am working on (among other things) an OpenGL-based graphics engine for a game with a very high triangle count.

The last few days I have been optimizing the code and I found something strange - I seem to have a black hole in my code draining performance.

The program is multithreaded (though SDL) and rendering has its own thread. Profiling has shown me that 99.91% of the time in the demo program is spent waiting for rendering to finish. Nearly all of the CPU-intensive tasks (of my creation) are performed in that last 0.09%.

At first I simply thought the graphics card was limiting but that is not the case. Moving from an Intel 865 integrated chip to a Geforce 3 or an ATI FireGL X1 gives no more than twice the framerate (from 20 to 40 fps).

The scene is a rendering of 300 objects of 550 triangles each (though only 6 geometries). This is currently displayed using vertex arrays (glDrawElements with GL_TRIANGLES). Each vertex has position, normal and texture coordinates in floats. There is a single texture on the triangles.

So far I have tried the following with no or very little effect on the framerate:

  • Turned off the textures and blocked all calls to send the textures to graphics memory and activation of these.
  • Decreased the number of triangles in each object to 260.
  • Used display lists for all the drawing (6 lists in total).
  • Switched between matrix loading and calls to glTranslate etc.

The only thing that seems to have effect on the framerate is decreasing the number of instances drawn of the six meshes.

Some statistics I have:

  • I only get a vertex processing rate of 5-10M vertices/second on a geforce 3 (Athlon 1.33 GHz). As I dont use strips, that is about 2-3M triangles/second.
  • I change material settings 9 times each frame.
  • I do one push, load, pop on the modelview matrix for each instance.

I realize that there are planty of things I can do to boost performance. But what I would like to know is where I loose performance. Seems to me it is probably either some CPU intensive task hidden in the drivers or some bus that isn’t fast enough.

Any help with this problem is appreciated! Sorry about the lengthy post, I wanted to describe the problem in detail.

Cheers,
Morgan Johansson

VBO?

Check whether your are fill rate limited, render to a much smaller window and see if your frame rate increases.

If its not fillrate then using VBO or VAR/VAO should give you better performance.

In summary, you are drawing 300 * 550 = 165 000 triangles per frame @ 40 fps = 6.6 MTris/sec on a GF3 only using standard vertex arrays. The performance seems to be normal in that case.

You say that 99.91% of your time is spent in “waiting for rendering to finish”. I’m assuming you’re speaking of swapbuffers here. If your rendering code is quite simple with no CPU work, a theory could be that all your calls are queued up by the driver; when swapping the buffers, the queue might be full and needs to be emptied before giving the hand back to the program.

It can be a bandwidth problem; try modifying your vertex format to only a position, and check if the performance changes. Since you’re not using any fancy shading or texturing, i don’t think you’re fillrate limited, but as it’s easy to test (just decrease the resolution of your window), also check that.

Y.

VBO support is planned in the future, but the question is not really what I can do to improve performance a bit, but rather what could possibly be wrong as performance is so bad? I would have expected much higher vertex processing rates than I currently see. DL:s should give me fair performance, should they not?

Fill rate is probably not the problem. 800x600 rendered at the same rate as 640x480. Also, I would have expected graphics cards to make a difference if fill rate was the problem.

I only get 20 fps with the geforce 3 (the same as with the Intel 865). The 40 fps was with the ATI FireGL X1. As far as I can tell these are very low numbers. The Intel 865 should never be as fast as the geforce 3 unless the CPU is the problem.

“Waiting for rendering to finish” is actually the thread syncronization. The main loop sleeps until there is work to do (mostly transforms in this case).

I will try modifiying the vertex format, but as I have tried rendering everything using display lists I wonder if it could really be a bandwidth problem? Are not display lists always stored in graphics memory?

Yeah, if you were bandwidth limited, i would have expected a gain of performance when switching to display lists. How are you building them ? You also said that you tried reducing the number of triangles per object and didn’t see a difference, which seems to suggest the problem is not geometry or bandwidth related. Although with standard vertex arrays, if you’re having 10M vertices/second, that’d be a throughput of 320 Mb/second, that’s still quite high.

All of these suggests a CPU bottleneck. Are you sure you’re not forcing the renderer thread to wait somewhere (you mentionned multiple threads, what does it give if you only try one thread?) ?

Y.

I am rendering the display lists like this:
*glGenLists
*activate textures and array pointers
*glNewList
*glDrawElements
*glEndList

Calling it works, and it does improve performance a little bit (from ~18 to ~21 fps).

I do have one place where there could be some waiting, but that is a single mutex lock in the rendering (called once/frame) and it can only cause waiting during those 0.09% of the time. I’ll have to check that further though. But the way I look at it, there is hardly any other way to write a multithreaded engine.

Good suggestions! Thank you! It will take me some time to check this.

neomind, you are using SDL’s multithreading?

there can be very strange things happening when using SDL multithreading and openGL. i tried it once and remember alpha blending just refusing to work. i spent days on this problem, but in the end SDL’s threading turned out to be the problem and when i removed it, everything worked fine.

apart from that (obviously you are not having this problem), i think the SDL docs say something about mutex fairness for threads cannot be 100%guaranteed or something like this. i also had some problems with synchronizing my multithreaded engine, both with concurrency and speed.

so if i were you i’d try rendering without multithreading and see what happens. <EDIT> see my post below</EDIT>
if all else fails, you might want to try using windows API threads instead. if you don’t have to be cross platfrom compatible, they work ok.

hope that helps
hoshi55

[This message has been edited by hoshi55 (edited 01-21-2004).]

i couldn’t find the section about SDL concurreny imprecisions, maybe i read that on some mailing list, but this is from SDL’s docproject site:

In general, you must be very aware of concurrency and data integrity issues when
writing multi-threaded programs. Some good guidelines include:

Don’t call SDL video/event functions from separate threads

Don’t use any library functions in separate threads

the SDL FAQ has some more specific advice:

Q:
Can I call SDL video functions from multiple threads?

A:
No, most graphics back ends are not thread-safe, so you should only call SDL video functions from the main thread of your application.

didn’t you mention that you call your rendering stuff from one thread and leave the transforms etc to the main thread? maybe you should try rendering from the main thread.

You have backface cull enabled?
What about depth test.
When yo do glClear
GL_DEPTH_BUFFER_BIT
or
(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT));
try just GL_DEPTH_BUFFER_BIT

Anyway you wont see at once all 500*300 polygons because they are culled one by each other, and some of them may be out of screen.
So…
you have to reject all unseen polygons
from submitting them to videocard if they are out of camera field view. You can start with the objects, bound-box them and to a fast check if the object is out of vieing frustrum.
Alternative… If the scene is static I would build a BSP out of it. This alows you to do a back to front drawing,back face cullijg, (disabling the z order on videocard and backface culing) and to cull unseen items against frustrum pretty fast.
Doing this on processor side will imrove your frame rate considerably.

[This message has been edited by mariuss (edited 01-21-2004).]

Yes, I am using the SDL multithreading. I am however, using the renderthread for all OpenGL calls. So far I have not had any problems (except this one if that is the cause).

But given the strange nature of the problem it seems wise to try without the threading. It could very well be the cause.

I had not heard of the mutex fairness problem. That in it self should not cause this I think, but it might still be wise to use caution.

Thank you for your advise, hoshi!

Back face culling is enabled.
Depth test is enabled.

The scene is dynamic so I am using view frustum culling and a kind of scene partitioning that allows me to cull multiple objects at once. Nothing fancy, but it works well. Anyways, the problem is that I cannot draw enough triangles (those that aren’t culled).

[This message has been edited by neomind (edited 01-21-2004).]

Did you try interleaving the data. I also found that it made a noticeble improvement if the data was byte aligned. Ie 32 byte as opposed to 24.

I also found glDrawArrays worked faster than glDrawElements in certain cases.

EDIT:
Just remembered. I read this article once which found that the geforce 3 to be only about 2 as fast as Integrated Solutions, when not dealing with programmable code.

[This message has been edited by maximian (edited 01-21-2004).]

Here’s a quick speedup trick, not guaranteed for all apps but worth a try: Move the glSwapbuffers command to the start of the frame rather than the end of the frame. If needed, add your own MP-safe flag to indicate that all client-side rendering has finished so you can start processing your next frame immediately (animation, physics, etc…)

Also, are you windowed or full-screen? There’s a difference in swap behavior, though I’m fairly sure you’re right about being stuck in synchronization for some reason. This might mean you have plenty of rendering time left but swapbuffers is blocking (hence the first suggestion). Might also test this with or without vsync turned on.

Avi

Ok, I’ve tried the engine without threading. Made no difference at all.

Haven’t had the time to change the data format. It is quite well embedded in the engine so it will take som time to change that.

The program is running full screen 800x600 in 16 bit color depth.

Neomind, you seemed to have missed a critical piece of information, so I’ll quote what Ysaneya said:

In summary, you are drawing 300 * 550 = 165 000 triangles per frame @ 40 fps = 6.6 MTris/sec on a GF3 only using standard vertex arrays. The performance seems to be normal in that case.

To reiterate, your performance is pretty reasonable, even at 20fps, for a GeForce 3 without VBO or VAR.

Now, to answer your “where I loose performance?” In the drivers.

The GPU can’t directly access system memory, unless it is AGP memory. So, when you set up your vertex arrays, and call glDraw* to render them, the driver has to copy these vertices out to memory that the GPU can read directly. Also, it must do 2 other things:

1: Make sure that the vertex data is in a format the GPU can read. So, if the GPU can’t handle unsigned shorts, and you have positions as unsigned shorts, it must convert them into floats during the copy operation.

2: Make sure that the number of indices is smaller than the hardware-defined limits on number of indices that can be drawn at once.

You probably aren’t hitting #2. But #1 you may be, if you’re using a vertex format that is not supported by the hardware. Stick with floats if you want to guarentee support.

Note that the driver must do copy operations each time you render an instance, since it can’t guarentee that you haven’t changed vertices since the last call, even if you haven’t called a gl*Pointer since then.

In short, the driver has to do a lot of copying. A GeForce 3, using VAR or VBO, can get much better performance, thanks largely to the lack of copy operations that have to happen.

I don’t think it’s that simple. He mentionned his vertex format (pos, normal and tex coords in floats), very standard. Also if it was that kind of problem, decreasing the polycount should affect the framerate, which he says it isn’t.

Y.

Note that when vendors say “twenty-hojillion triangles per second,” you can only achieve that if you terminate every other thread on your computer, and bake all of those triangles in a single display list, and use a single modelview matrix and texture for all of them, and it takes an entire second to render that frame.

If you have a regular appplication, that draws many (say, hundreds) of objects per frame, with many (say, hundreds) of material, light, texture and modelview states, then your performance in triangles per second will be nowhere near the rated peak throughput. That’s just the fact; learn to live with it.

Also, the first thing I was thinking was “fillrate bound, probably?” – have you tried with a smaller window? (Say, 1/8 the size)

Last, if you’re bound on just waiting for the pipeline to flush in swapbuffers, you may be able to add a lot more CPU processing on your side with no drop in frame rate on HT&L cards – i e, your CPU usage could go from 1% to 10%, and the wait-usage would go from 99% to 90%, and you’d still have the same frame rate. On non-HT&L cards (Intel Extreme, for example) that’s less often the case, unless you’re extremely fill-bound, and your CPU work doesn’t stall on memory much.

Thank you all for your advise on this subject.

To further mystify this problem, I have now tried (should have done it long ago) using much more complex models (10.000 triangles and 30.000 vertices). Suddenly I am getting the performance I was expecting.

Displaying the same number of objects, I am now getting around 50M processed vertices each second. Still using nothing more than triangles in display lists. The vertex format remains the same.

Displaying the same number of objects, I am now getting around 50M processed vertices each second. Still using nothing more than triangles in display lists.

What this is saying is that the number of vertices is clearly irrelevant; it’s the number of instances. Which leads to the following questions:

1: How much state are you setting up per instance?
2: How many different textures are you using?
3: Do you still get this performance if you don’t use display lists?