2 performance questions

  1. Is it in general better (faster) to use the fixed function pipe (if possible), or can i simply use vertex-shaders for everything? I´ve heard Radeons implement the fixed pipe through their vertex-shaders, so using your own simple vertex-shader might even speed that up, no?

  2. At the moment my portal-engine puts all data in one big VBO, binds it once and renders from that buffer all the time.
    Are drivers able to speed rendering up, if i create own buffer per sector and therefore bind much smaller buffers, although i have to switch the bound buffer frequently?

Thanks,
Jan.

  1. Dunno. If you have the cards, you can benchmark it
    If you’re not pushing insane amounts of polygons, I’d prefer vertex programs for everything. Much easier to maintain IMO. I also seem to remember ATI reps stating that switching between fixed function and vertex programs is rather expensive.

  2. I’d imagine that the driver’s VBO memory management (in particular BufferData, BufferSubData and the locking stuff) can be made more efficient if you have multiple smaller VBOs instead of a single big one.

For #1, if you wont be doing anything custom with the vertices, you can use the hint (ARB_position_invarient?? look it up)

In principal, it should tell the card to the use the fixed function circuits and maybe you’ll get a speed up.

I see a lot of people doing

DP4 oPos.x, m[0], iPos;
DP4 oPos.y, m[1], iPos;
DP4 oPos.z, m[2], iPos;
DP4 oPos.w, m[3], iPos;

for no reason.

Yeah, i use ARB_position_invarient, it´s necessary for multipass stuff and it less code to write :wink:

@zeckensack:
Can´t benchmark it, don´t got a Radeon :frowning:
Yes, i asked, because it´s easier for me to simply use vertex-programs for everything, therefore i was wondering if it would be worth it to change some code to used the fixed pipe.

Hm, will have to try that out with several VBO´s for each sector, hope that gives me a speed-up.

Thanks,
Jan.

Originally posted by Jan2000:
@zeckensack:
Can´t benchmark it, don´t got a Radeon :frowning:
Yes, i asked, because it´s easier for me to simply use vertex-programs for everything, therefore i was wondering if it would be worth it to change some code to used the fixed pipe.
Well, I kinda dropped you a hint: it won’t matter at all if you’re not transform limited. Fixed/VP switching will then be the only thing you have to worry about.

Hm, will have to try that out with several VBO´s for each sector, hope that gives me a speed-up.
If you’re doing what I think you’re doing (static geometry), there’s not much point in splitting it up.

Note that occasionally replacing ‘sectors’ (or nodes, cells, whatever) would make it non-static …

Pick one:
[] levels and loading screens
[
] large, streaming world

The basic idea is, with purely static level geometry, you can just use one huge VBO and select sectors via indices. Switching VBOs will probably be more expensive because you’ll have to move your vertex pointers around all the time and the driver has to figure out what’s going on.

With dynamic geometry, things are very different. Whenever you respecify data for a VBO, the driver must make sure that it doesn’t change a memory region the GPU is currently reading from. The easy way out would be an implicit glFinish …

The trick is that you get a new memory region if you respecify data for a VBO that is (or is going to be) used, so that the GPU can keep working and your app doesn’t stall. The old mem region can be released back to the free memory pool (after finishing the draw commands that reference it cough NV_fence cough).

This is what’s getting easier with several smaller VBOs.

Originally posted by V-man:
[b]I see a lot of people doing

DP4 oPos.x, m[0], iPos;
DP4 oPos.y, m[1], iPos;
DP4 oPos.z, m[2], iPos;
DP4 oPos.w, m[3], iPos;

for no reason.[/b]
DPH is sufficient (and reportedly faster on NV3x), unless you use 4d vertex positions.

  1. Yes, the fixed function is faster on some cards. Much faster on things like GeForce 2 and 4MX, because vertex programs are software emuated. However, there is a cost to swtiching between fixed function and vertex program mode, so don’t switch frequently; one pass for everything fixed, and one pass for everything shaded, per frame, would probably be best.

  2. Yes, smaller submissions are probably better in the case you describe, assuming your data is “typical”. If each chunk is, like, 50 polygons, then you probably want to batch them up into bigger chunks. It’s also important to keep material usage homogenous; it’s best if you can use a single texture/program/material state per cell so that you don’t have to submit multiple chunks for static geometry for each cell.

As always, though, it always comes down to what your specific geometry load is, what the capabilities of your scene graph actually are, what cards you’re targeting, where in the pipe you’re actually bottlenecking on which cards, …

Get what you’re targeting, and measure it. Improve. Repeat.

Well, in my engine one “sector” typically consists of 3000+ polys. My levels consist only of few sectors, but those are “quite” big.
And this data is absolutely static, no moving stuff. So does it already make sense to put those sectors into a single VBO?

About Vertex-Programs: When i had my Gf2, i didn´t notice a real speed-difference when i enabled VPs, but that was only for a small piece of test-data. So, please don´t tell me that T&L will be disabled completely when i use VPs??? That might be a problem.

Thanks,
Jan.

In such a case, I suggeste one VBO per material per sector, with an eye towards reducing the number of materials per sector.

As far as I know, HT&L is fully turned off on a GeForce2 and GeForce 4 MX if you’re using ARB or NV vertex programs. This will reduce frame rate if you’re planning to use the CPU for something else (like physics/simulation) or if you’re planning to run on older machines with slower CPUs.

Get what you want to target. Measure it. Improve. Repeat :slight_smile:

What exactly do you mean with “materials”?

Originally posted by Jan2000:
What exactly do you mean with “materials”?
Combination of shaders, bound textures, blend mode and other tiny bits.
Think “state change”, that just about sums it up.

Ah, ok, now i got it.
Yeah, i sort by textures, etc. to reduce state changes.

Thanks,
Jan.

The basic idea is, with purely static level geometry, you can just use one huge VBO and select sectors via indices. Switching VBOs will probably be more expensive because you’ll have to move your vertex pointers around all the time and the driver has to figure out what’s going on.

What happens if the vertex format for different pieces of static geometry changes? If that happens, you need to call gl*Pointer anyway, so you may as well have different VBO’s for objects. The cost of using different VBO’s is (or, should be) relatively cheap.

Also, static geometry is still static even if it’s streaming; you’re certainly not going to use a dynamic or stream vertex buffer type. As such, you’re going to need to have several VBO’s around for such cases.

I find it much simpler to just have one VBO per object (if possible. Sometimes it isn’t), with a gl*Pointer call for each material of the VBO.

Originally posted by Korval:
What happens if the vertex format for different pieces of static geometry changes? If that happens, you need to call gl*Pointer anyway, so you may as well have different VBO’s for objects. The cost of using different VBO’s is (or, should be) relatively cheap.[/b]
I somewhat assumed that all of this static geometry uses the same vertex layout, which appears to be true for most terrain renderers and, in general, the bulk of indoor geometry, too. Sure, this isn’t always that case and you’re right at pointing that out.

Also, static geometry is still static even if it’s streaming; you’re certainly not going to use a dynamic or stream vertex buffer type. As such, you’re going to need to have several VBO’s around for such cases.
I’ve mentioned that streaming in different parts of the world would make the world geometry as a whole non-static, for purposes of the distinction I tried to make, and would require more buffers (=better granularity).

But I’m not sure what you mean. Maybe I misunderstood you?

I find it much simpler to just have one VBO per object (if possible. Sometimes it isn’t), with a gl*Pointer call for each material of the VBO.
Agreed for ‘objects’, and anyway for ‘general’ usage. But then …

For constrained indoor scenarios where the complete world, or rather its non-moving parts fit into a reasonably sized VBO, you can save some overhead. Low hanging fruit, so to speak.

Also this paper indicates that pointer updates should be kept at a minimum, which is exactly what I had in mind here.

I really should add one thing though: going huge VBOs also means that the indices won’t as handily be representable as ushorts anymore. I should have remembered that there are certain cards that don’t like uints.

Oh well

Also this paper indicates that pointer updates should be kept at a minimum, which is exactly what I had in mind here.

In general, I find some of the points made in their paper to be… odd. Like the point about using glDrawRangeElements providing “precious information for the VBO manager.” By the time you call glDrawRangeElements, any VBO management work ought to be long finished. Maybe there’s an issue with paging in/out VBO’s, but you should page the whole thing in if it’s going in, not pieces of it.

And why is deleting a buffer with glDeleteBuffers a “heavyweight operation”? Is there any real need for it to be any heavier than calling glBufferDataARB with NULL?

Also, why are memory management operations being done in glPointer calls (which, for some reason, they call glVertexArray calls)? Indeed, what memory management needs to be done there? We’ve already selected the VBO to be used. If it isn’t in memory, it should have been paged in when we selected it. And, even if it is paged in with glPointer calls rather than binding, this operation is one that we understand is going to be needed. Like using a texture, we understand that we may incur a performance penalty when we request an infrequently used VBO.

Indeed, I feel that using lots of smaller buffers is better than one big one. If you can’t see your entire level at once, and you’re not drawing what you can’t see, then the memory for non-visible objects may not be in video memory; it may have been paged out to AGP or even system memory. Granted, if you have a sudden need for it, you incur a hit, but you’ll get a hit on textures too. In one giant VBO, the driver can’t page pieces of it out, so you have to keep the entire VBO in the appropriate memory. The extra memory can go to actually visible objects and textures, so you get less paging.

OK, an implementation could break up a giant VBO, but that would require knowing just how much of the VBO we are going to be using in any particular draw calls.

[This message has been edited by Korval (edited 11-27-2003).]

The NVIDIA whitepaper is a little weird. It’s talking about calling VertexArray() a lot, by which I believe it actually means VertexPointer(), for example.

Anyway, deleting a buffer is more heavy-weight than calling BufferData with NULL, because it can use re-naming of the buffer when you reallocate it with NULL, and doesn’t have to worry about whether the old copy of the buffer has finished drawing or not. When you delete the buffer, however, it probably has to synchronize up to the point where all geometry out of the buffer has been drawn. You would think that they’d use the same mechanism for both these cases, but there could be some implementation gotcha that doesn’t make that attractive.