VBO/VP/AttribArrays Horribly slow on GFFx 5200Ultra

WTF! I turn FSAA to 8x from 2xQ and I get 76fps. With it at 2xQ I get almost 10 less. 2xQ is supposed to be WAY faster (with less smoothing of course) right? I also get 76 to 80 with FSAA off. Something is REALLY weird.

EDIT: Ok that is the case for some reason in that VBO earth app. In Quake 3 I lost about 150fps from 2xQ to 8x FSAA.

-SirKnight

[This message has been edited by SirKnight (edited 11-03-2003).]

Originally posted by Csiki:
[b] I had the same problem with GeforceTi4200.
It seems that the new 52.16 driver do same cruel optimization with the static data, but have a lot of problem with it if it’s changing frequently (one change in all frame)…
Use DYNAMIC_DRAW instead. Unfortunately this seems to be a simple vertex array implementation…

[/b]

I NEVER change the vertex data once initially loaded, I only change vertex program parameters (shader constants).

mtm

Originally posted by Elixer:
[b]So you can see if it is a driver issue, get this: http://www.codesampler.com/source/ogl_vertex_buffer_objects.zip

And running it…With VBO I get around 250fps, without (Vertex array) I get 30.
(this is just by running the program, and toggling VBO on/off with F1.)

I know it isn’t quite what you had in mind, but it may shed a crumb of info for you.

[/b]

I get 120 with VBO, 20 without - however that source is not using ARB_VERTEX_PROGRAM and it is not using generic vertex attributes.

Im reasonably sure a GFFX5200Ultra has hardare ARB_VERTEX_PROGRAM support, so it shouldnt be agp transfer screwing things up here…

mtm

[This message has been edited by tweakoz (edited 11-03-2003).]

In my own experience Radeons
are beating GeforceFX using
VBO with separate static
arrays, running 3.8 against 52.16.

I put all my data into VBO memory
and just call draw elements with
buffer offsets. The speed up over
system resident arrays is huge
on the Radeons (500%), and much
smaller with the GeforceFX (~100%).

Two figures I can find at the
moment ( I’ve a lot more but
they are on my other PC).
GeforceFX 5600 - 25.2 MTri/s
Radeon 9800 Pro - 183 MTri/s
5900 is ~ 60 MTri/s IIRC.
Those figures are from
high spec 3GHz+ P4s
running XP Pro.

To the best of my knowledge
I’m not doing anything daft,
as I’ve followed the spec
closely, and tried to account
for all other variations.

VBO appears to work well
on ATI, but need fixes
or optimisations on Nvidia.

Ok I just ran the program again without changing anything and it goes from 71 to 135 fps. Strange how VBO all of a sudden makes a difference when a while ago it didn’t. Still though, this seems like it should be faster.

-SirKnight

I ‘hacked’ in VBO based indices and edited the orginal post with the new code. This did not get me any FPS back ;<

mtm

I should also add, with my 44FPS vs 8FPS(VBO) benchmark everything is batched reasonably. I would estimate 7 tristripped batches per frame with about 3200 indices per batch.

mtm

First: Run VTune (or another sampling profiler) on your system while the program is running, both with VBO and without it. You’ll probably find a BIG spike somewhere in the VBO case. This is where you’re spending all your time. Look at the code (may need disassembling) – what is it trying to do? Packing/unpacking values? Copying data? Calculating min/max? Whatever it’s doing, figure out what part of the OpenGL API woudl need that performed, and make it not necessary by adjusting how you call it.

For example, if it’s un-packing, say, signed bytes to floating-point (and this is just as a wild example), and you’re passing normals in as signed bytes, then you can draw the conclusion that this is not a supported data format, and you’re better off passing normals as float.

Second issue: if you get the “VPU Recover” alerts, then it’s very likely that your motherboard, memory bus, or AGP bus is not quite up to spec, and there’s either a chipset bug, or a signal quality problem. Raising voltages a little bit may help if it’s the latter; if it’s the former, get a better mobo

Originally posted by jwatte:

Second issue: if you get the “VPU Recover” alerts, then it’s very likely that your motherboard, memory bus, or AGP bus is not quite up to spec, and there’s either a chipset bug, or a signal quality problem. Raising voltages a little bit may help if it’s the latter; if it’s the former, get a better mobo

Well, that could be, then again, I think it is driver bugs, or very picky timing for cat drivers, since those same programs I tried work just fine with my old GF2 card.

Just curious what is a ‘better’ mobo? I mean, what do you consider good?

According to the small demo:
60 FPS without VBO
150 FPS with VBO

FX5600(52.16 - slightly overclocked ), Athlon XP 1800+(now reborn as 2000+), nForce2, AGP 8X, fast writes enabled, sideband adressing disabled

Originally posted by tweakoz:
[b]why is the following code so slow on a NVidia Geforce FX 5200 Ultra ?

With NVidia’s help I figured it out, and will post the results here in case anyone else runs into this:

//////////////////////////////////////////////////////////
this is fast: glVertexAttribPointerARB( 2, 3, GL_SHORT, KVANRM_FALSE, 32, (void*) 16 );
this is slow: glVertexAttribPointerARB( 2, 3, GL_SHORT, KVANRM_TRUE, 32, (void*) 16 );

slow: glVertexAttribPointerARB( 3, 1, GL_UNSIGNED_SHORT, KVANRM_FALSE, 32, (void*) 22 );
slow: glVertexAttribPointerARB( 3, 4, GL_UNSIGNED_BYTE, KVANRM_FALSE, 32, (void*) 22 ); // I2
fast: glVertexAttribPointerARB( 3, 4, GL_UNSIGNED_BYTE, KVANRM_TRUE, 32, (void*) 22 ); // I2
fast: glVertexAttribPointerARB( 3, 2, GL_UNSIGNED_BYTE, KVANRM_TRUE, 32, (void*) 22 ); // I2

So shorts and bytes are opposites as far as normalization is concerned.

here is an excerpt from an email from NVIDIA

> At first blush, my guess is that in the VBO case, the use of
> an attrib array as UNSIGNED_SHORT is causing us to fall back
> to non-pulling paths; we can’t do vertex pulling with all
> types of data. I looked quickly and I can see that for
> generic attribs, SHORT, FLOAT, and HALF_FLOAT are supported
> for pulling unless the attrib is normalized and then
> UNSIGNED_BYTE is added and SHORT is removed. In fact, I don’t
> think we support UNSIGNED_SHORT vertex pulling in any circumstance.
>
> When we fall back to inline methods, if the VBOs are in AGP
> memory, we read the data from there - which is very slow.
> This is likely the reason why the performance drops so much.
> As for why it speeds up in non-VBO, the data is put in system
> memory so the read to place it in the pushbuffer is fast.
>
> If the user changed the UNSIGNED_SHORT to SHORT, FLOAT, or
> HALF_FLOAT, performance should be greatly improved.

mtm

Originally posted by tweakoz:
[b] [quote]

//////////////////////////////////////////////////////////
this is fast: glVertexAttribPointerARB( 2, 3, GL_SHORT, KVANRM_FALSE, 32, (void*) 16 );
this is slow: glVertexAttribPointerARB( 2, 3, GL_SHORT, KVANRM_TRUE, 32, (void*) 16 );

slow: glVertexAttribPointerARB( 3, 1, GL_UNSIGNED_SHORT, KVANRM_FALSE, 32, (void*) 22 );
slow: glVertexAttribPointerARB( 3, 4, GL_UNSIGNED_BYTE, KVANRM_FALSE, 32, (void*) 22 ); // I2
fast: glVertexAttribPointerARB( 3, 4, GL_UNSIGNED_BYTE, KVANRM_TRUE, 32, (void*) 22 ); // I2
fast: glVertexAttribPointerARB( 3, 2, GL_UNSIGNED_BYTE, KVANRM_TRUE, 32, (void*) 22 ); // I2

So shorts and bytes are opposites as far as normalization is concerned.

here is an excerpt from an email from NVIDIA

mtm[/b][/QUOTE]

here is more useful info from NVIDIA.

For HALF_FLOAT usage, you should simply be able to pass in GL_HALF_FLOAT_NV
for the data type in place of GL_FLOAT. However, this will only be
accelerated on FX or better hardware - NV30 and beyond.

If you choose to use this data type, one issue you’ll need to keep in mind
is that you’ll need to format your data for use as half floats - you can’t
simply use the same 32bit float or 16bit short data. The spec defines the
format and conversion:
http://oss.sgi.com/projects/ogl-sample/registry/NV/half_float.txt

mtm