Performance Question

JotDot · March 23, 2004, 3:33pm

I feel a bit foolish asking this (and this might not be the correct forum) but here it goes:

In my spare time I am setting up an engine but I am confused to it’s performance (good/bad).
A bit of needed info: I have a GF4 ti4200 (x4 AGP 64Meg) and a 1.6 Gig P4.

I checked the Nvidia site and looked at the peak performances. I think I read somewhere that the nvidia performance statistics are extremely optimized and that I should not expect results close to it in my application. What I wondered is if I am “on the mark” or is there something I am doing wrong (since my results are much lower).

I set up my test app to generate “lots” of tris in order to check how well it is performing. It has a skydome with approx 21.6K tris. A patch of ground with 8K tris, and a “pillar” in the center with 8K tris as well. At this primative stage I just sent everything to the card.

3 Textures totalling 38016 tris. Each texture is done in order (ie: Only 3 texture changes) using vertex lighting.
Static VBO’s for everything.
Indexes are GL_UNSIGNED_SHORT and I use glDrawRangeElements.
I am sending GL_TRIANGLES.
I only clear the Z buffer each frame.
My “hud” displays position, rotation, and FPS. This is the only part that is still done with VAs.

Currently the VBO indexes are sent row by row. I read an article from nvidia about sending everything in one tri strip and using degenerates. Does this help much or is it not worth the bother?

I set up some simple profiling code and included the percentages (in case it helps).

Performance (24 bit color, 8 alpha) fullscreen:
395 fps @ 800x600 Processing Time: glDrawRangeElements 38% SwapBuffers 24%
300 fps @ 1024x768 Processing Time: glDrawRangeElements 44% SwapBuffers 47%

At first glance I thought it was good but when doing the maths it does not seem that great. Opinions? Suggestions? Any help in this matter would be greatly appreciated as I am hesitant to continue until I feel this is resolved.

Jens_Scheddin · March 23, 2004, 3:52pm

I think you are experiencing normal/good results. You could further check your application with a performance analyzer like VTune or so. Otherwise it’s hard to say where you could speed up something without having your complete source.
I don’t think that triangle strips will help much because if you are sending the triangle indices in a cache friendly way, the driver automatically generate strips.
From an OpenGL point of view, I don’t think that there is much you can do in terms of optimization if you are already do your own state caching in your application to keep state changes at a minimum.
My tip is to test everything out for yourself. This way you’ll get the optimal performance for your application, cause every app is different (well, almost).

God, i’ve got to go to bed now…

Hope this helps

Ysaneya · March 23, 2004, 11:17pm

You need to learn what a bottleneck is.

There is no point in trying to increase the data transfer rate, or the transform rate if you are fillrate limited, which you are at these kind of framerates. The fact that you loose 95 fps by going down from 1024x768 to 800x600 is proving that.

Y.

SeskaPeel · March 23, 2004, 11:35pm

I agree with Ysaneya, and further more, your HUD may be causing some overdraw.

NVidia web site has tons of bottleneck tutorials. Try to catch an up to date one though.

And for strip optimization, I fear the hardware won’t generate strips automatically, but rather works with a pre transform and a post transform cache. Remapping the indices of your vertices in your array may drastically increase performance. The small tricky point is the pre transform cache : when you “upload” a vertex to the pipeline, the hardware take a bunch of his neighbours with him, and store all of them in the pre transform cache. Remapping the buffers will make good use of this cache. As a second step, you can try to use tri strips, but in all my tests, it gave porr results.

SeskaPeel.

Jens_Scheddin · March 24, 2004, 3:27am

Just to clarify what i ment with “sending triangle indices in a cache friendly way”:
If your triangles look like the fine ASCII art below, you should send your indices in the order 0, 1, 2, 2, 1, 3 assuming clockwise order (next would be 3, 1, 5, 5, 1, 4). This will let the driver reuse the same vertex (here index 2 for the first two tris) without fetching it again from some memory. So it takes only two vertex transformations to draw a triangle following the first on (which takes three). This is equal to a triangle strip, IIRC. I think that’s what’s going on in nVidia and ATI drivers since Geforce days (dunno 'bout other vendors).

0-------1-------4
|     / | \     |
|   /   |   \   |
| /     |     \ |
2-------3-------5

SeskaPeel:
Well, of course the HUD will cause a bit overdraw but how would you change it other than not drawing it?
I do agree with you about triangle strips not beeing a good idea. They are causing too much work for the little speedup (if there really is one). Just think about breaking your geometry into pieces because of different textures/texture coordinates…bah.

JotDot · March 24, 2004, 12:50pm

Thanks for the tips

Currently I am sending the rows in 0-1-2-2-1-3 order. Floats all around except bytes for the color. Each array is tightly packed.

The strip optimization I was reading about is here:
http://developer.nvidia.com/object/devnews005.html
Scroll down to the “Coding Tip” section. (Don’t use the links near the top of the page.)

Jens Scheddin: As you were mentioning, the 0-1-2-2-1-3 order takes two vertex transformations to draw a triangle but the method in that link states “only 1 vertex for every 2 triangles needs to be computed”.

Has anyone tried the method described in that Nvidia link? I always was curious as to whether or not that method was worth the effort.

Ysaneya: I wasn’t sure about being purely fillrate limited. Let’s say that the 800x600 mode was fillrate limited - at 1024x768 I would assume that I would only be pushing about 241 fps. I get 213 fps @1280x1024 which (to me) looks like it’s more than a fillrate issue. (I’m not claiming to be even close to being an expert in this area by the way.) It can’t be AGP bound since everything currently resides in VBOs. Conversely, if the 1280x1024 mode was fillrate limited, then I should be getting up to 355 fps @ 1024x768 or up to 580 fps in 800x600. This is what got me to think that something else must be a limiting factor - which I am hoping to identify. That’s why I was wondering if tri strips (as mentioned in that link) was the way to go.

About the HUD: It’s only a bit of text at the moment. I was mentioning the fact I was using VA for it right now. I am planning to convert the HUD system over to VBO but was planning it for a later stage once I finalized other aspects of my code. I mentioned it just in case there were any issues in mixing VA and VBO.

I now did some searches on optimizing performances and am going to try out some tests they mention. I might be worrying for nothing - I just wasn’t sure if the results were good for VBOs. If not, where could I look to improve things.

Thanks for your time and effort.

Tom_Nuydens · March 24, 2004, 10:25pm

What is it that makes people want their apps to run at 500 fps? If anything, if your app runs at hundreds of frames per second, you should be looking for ways to make it slower, not faster (i.e. give the video card more work).

You’re never going to reach your card’s advertised peak triangle rates with a benchmark that runs at 500 fps, so if that’s what you’re interested in measuring, optimizing your current code won’t help unless you feed the card a lot more than 38000 triangles. I would suggest one million as a good starting point

– Tom

JotDot · March 25, 2004, 1:30am

Tom: My response to your first question is your own answer to it - I do want to make it slower. I am planning multiple passes and hope to get reasonable support on lower end cards. I just wanted to start off making a “basic pass” that is as efficient as possible. Although this equates into a higher fps - this is not my goal. The less time I spend on each pass will mean more time for other things I also figured nailing this portion down now would save me much grief later on as the project evolved.

I posted here since I felt my app wasn’t completely fillrate bound (contrary to Ysaneya’s opinion). For something as simple as this test - I thought it should be. fwiw: I realised today that the new monitor I got allows me to test 1600x1200 and I got 153 fps. Comparing that to the 213 fps @1280x1024 makes me think that around the 1280x1024 mark is when I get fillrate limited - not at 800x600. I simply wanted to see if I am correct in my assumption and what the possible cause is - and see if it was worth the bother to “fix” it.

If it means 1 million+ polies to help me write optimal code - then so be it I’ll give that a shot too. (I thought 38K was good enough. Obviously I’m wrong then - I did say I’m no expert on this.) Thanks for the tip.

Ysaneya · March 25, 2004, 4:29am

I don’t believe it’s entirely fillrate limited, but mostly, yeah. The thing is, there is different kind of bottlenecks in an application, and usually your framerate is limited by the most important one, fillrate in your case. And at these kind of framerates i would take any benchmark with a grain of salt.

In addition your calculations are suspicious. Although good in theory, in practise there is a lot of things to consider, like GPU/CPU paralelism, and things “behind the scenes”. You just can’t apply a linear formula and expect to guess the framerate for a given resolution. For instance imagine the VSync issue (i know it’s not your problem here, but just to show my point). If you are running at 60.0001 fps, you will see a virtual framerate of 60 fps. Let’s say you add a single more triangle, and your framerate goes down to 59.99999 fps. Suddenly you’ll see a virtual framerate of 30 fps. But you just can’t conclude that adding one polygon costs 50% of your performance each time. It’s pretty much the same at every level in the driver.

Y.

Tom_Nuydens · March 25, 2004, 4:58am

Originally posted by JotDot:
I am planning multiple passes and hope to get reasonable support on lower end cards. I just wanted to start off making a “basic pass” that is as efficient as possible.
A noble goal, but the problem with premature optimization is that you may spend time optimizing something that will turn out not to be a bottleneck at all. If the 38K triangle scene you’re using now is indicative of what you’re aiming for in the long run, you’re unlikely to become T&L-limited even when doing multiple passes.

Indeed, adding more passes may only make you even more fillrate-limited than you already are. In this case, your optimization efforts should be focused on reducing overdraw, not on improving your vertex throughput. The two require very different approaches. You can do both, of course, but chances are 50% of your time will be wasted if you do

– Tom

Jens_Scheddin · March 25, 2004, 7:22am

Jens Scheddin: As you were mentioning, the 0-1-2-2-1-3 order takes two vertex transformations to draw a triangle but the method in that link states “only 1 vertex for every 2 triangles needs to be computed”.

hehe, right. some time passed since i worked with strips . i gave it up because it wasn’t better than plain triangles for me, IIRC.

SeskaPeel · March 25, 2004, 7:42am

To Jens :
1/ HUD optimizations could be

Alpha test instead of alpha blend
In case of full screen HUD, split in separate quads where alpha is not 0.0 (or under alpha test treshold)

2/ There are 2 vertex cache, a pre transofrm, and a post transform. If you can optimize for the pre transform, you optimize as well for the post transform (makes sense, right ?). Tri strips are only a post transform optimization, remapping the buffer indices is a pre transform optimization, and it made my test app frame rate raise by a factor of 2.

To JotDot :
Again, I agree with Ysaneya, but the bottleneck problem is even more complex, as you might have multiple bottelnecks in a single frame. The first example that comes to mind is a bad CPU / GPU parallelisation.
And Tom is right, you will never optimize anything with a 500+ fps. The same test app with 200K polys (or a million as Tom suggested) could do it, and you should be aware that fog, lighting, normalization options and other OGL states can drastically reduce performance on such test. Be sure you never switched them on, or if your final app needs to use these, be sure to enable them as soon as possible.

SeskaPeel.

JotDot · March 25, 2004, 10:28am

Ysaneya: I agree with you completely - plus I was oversimplifying it. I thought at the time I shouldn’t be too technical. For example, the glClear I make for the z buffer is a “fixed overhead per frame” - I can’t take a straight linear formula. As pointed out quite nicely there are many more issues than what I just mentioned.

Tom: Yes, for a fillrate limited application I should definitely be focusing on reducing overdraw. I was just surprised that it wasn’t clearly fillrate limited (in my view) - which I was aiming for. This is the first time I am really trying hard to push the card. In the past I have explored different avenues for reducing overdraw. Heck, years ago I even wrote a portal system with a software renderer back in the good old days when my 2 meg S3 Virge was a “leading contender”. Now that was a major exercise! Fun though

Jens: Thanks for your input! I was asking about strips since I was aware it might become a pain in the butt to use them. I didn’t want to bother with that route unless people thought there were good performance benefits.

SeskaPeel: Thanks! I never really thought about the hud optimizations. Of course a write is better than a read-modify-write any day. Now thinking of it - my simple test does have yet another “flaw”: I am blending the text and thus needs a r-m-w which raises sync issues that were point out (thus affecting results). I probably am more fillrate limited than I initially suspected. About your second point: I never thought about the pre-transform in that fashion. Thanks I will keep that in mind.

I really appreciate everyone’s input. It has given me lots to think about

Jens_Scheddin · March 25, 2004, 12:42pm

Originally posted by SeskaPeel:
[b]To Jens :
1/ HUD optimizations could be

Alpha test instead of alpha blend

In case of full screen HUD, split in separate quads where alpha is not 0.0 (or under alpha test treshold)

2/ There are 2 vertex cache, a pre transofrm, and a post transform. If you can optimize for the pre transform, you optimize as well for the post transform (makes sense, right ?). Tri strips are only a post transform optimization, remapping the buffer indices is a pre transform optimization, and it made my test app frame rate raise by a factor of 2.
[/b]
Hmm, never thought about those HUD optimizations. I’ll try it for myself. About rendering strips: we’ll, i have to say that my geometry is probably not optimal for triangle strips (aprox. 5 triangles per surface due to BSP based indoor data), so for terrain rendering this might be a different situation. I really like this board because theres always something to learn like there are two vertex caches

(Besides, one interesting thing i found out today is that far cry has a OpenGL renderer, too. just set r_driver to OpenGL…)

wimmer · March 26, 2004, 3:14am

I’m not happy with the expression “I’m fill limited” or “transform limited”. Most applications usually have several bottlenecks within the same frame, so you usually don’t get “free stuff” by increasing workload for the presumably non-bottleneck stages.

Take a normal view of a tesselated terrain, for example (not from above). The triangles near you will be fill-limited, and triangles away from you will be transform-limited (except if all triangles are smaller than the “ideal triangle”).

The reason is that the post-transform caches are still too small to provide proper load-balancing in most applications…

Michael

JotDot · March 26, 2004, 5:15am

The first thing I noticed about that article I mentioned above (pertaining to tri strips) - is that the row size specified is 16. I thought that was a bit small. (But I’m no expert in this field.)

Part of what I am doing is terrain with vlod. I realized that I should try to reduce the number of batches sent. I decided to start with somewhere around 16x16 patches and experiment from there. If I decided to use strips, I was surprised when I discovered that I would have to use degenerates inside a patch if I wanted anything larger - not simply to stitch patches together in one batch.

Yes the expressions “fill limited” and “transform limited” are a bit over simplified. The GPUs are becoming much more complex which makes it tougher to simply put any single label on one problem.

This sure is different than when I played around with software rendering. Back then, all I needed to do is examine my code, rewrite a subroutine or two, and possibly bring out the assembler

SeskaPeel · March 26, 2004, 5:41am

Once and for all about tristrips : except in some specific cases, don’t expect anything about it. The real fight is about pretransform cache, and remapping buffer indices (should I explain what it is ?) do the job near to perfectly. What’s more, it’s a generic case method, and works even better when you use heavy vertex structure (position, normal, color4, tangent, 4 distinct textures channels - diffuse, detail, lightmap, normal map - morph targets, matrix skinning index, …). Still waiting for a cache that supports multiple rendering pass.

SeskaPeel.

JotDot · March 26, 2004, 9:19am

SeskaPeel: Yes, I am starting to think along the same lines regarding the tri strips.

I tried last night to search for more info on the pretransform cache but had not much luck so far. If you have the time, any additional info/insite/links about it would be quite useful. Right now, I am unclear as to whether or not I am doing things as efficiently as possible in regards to the pretransform cache - but I am still at the point where I could easily adapt my code if needed.

So far I understand the pre-TnL cache is “between” the video memory and the T&L unit (makes sense). I don’t have any references to the size of that cache. Is this where the GL_MAX_ELEMENTS_VERTICES / INDICES come into play? Since the cache is smaller than what you can throw at the video card, it does make sense that the vertices should be re-ordered so that we would minimize cache misses.

Other than that, this is all I got so far. Any corrections / additions / hints would be quite useful.

Thanks for your time

SeskaPeel · March 28, 2004, 11:02pm

The one and only resource :
NVTriStrip.lib implementation.
I wish you good luck,

SeskaPeel.