Bandwidth optimization

I was tweaking settings to get better frame rate and I discovered that what was all slowing down was vertex data bandwidth :

I ran my app in 16001200, FSAA 4 pass, 32 bits color, texture compression disabled, trilinear filtering, versus 800600, FSAA disabled, 16 bits, texture compression, and bilinear filtering, and I got quite the same frame rate (Athlon XP 2000+ / 512megs RAM / GeForce FX 5900 ultra 256 megs)

A major boost appeared when I disabled tangent and binormal vertex arrays (I use VBO). I read about the way to compute the binormal vector in vertex program, and I’ll sure implement this soon, but I was wondering if there was other ways to enhance perfs when the bottleneck is bandwidth …

I used the same vertex / fragment programs with tangent and binormal arrays enabled or disabled. Any suggestion ?

SeskaPeel.

Provide the right usage hints for vertex_buffer_object.

Use up-to-date drivers.

Use “short” rather than “float” for data.

Use good alignment on your data types and vertex arrays.

Interleave as much as possible.

Run VTune and make sure you’re not hitting a software unpack path, where the driver would read you data back and somehow reshape it before sending it to the card.

Use “short” rather than “float” for data.

You can’t guarentee that ‘shorts’ are an optimized format. If they aren’t, he may as well not use VBO. Granted, ATi’s performance faq tells us that all GLshort formats are optimized, but that’s only works for 9500+ cards. Lower-end cards, or FX cards, may not optimize shorts.

> Use good alignment on your data types and vertex arrays.

In detail, does it mean that I should pass 4 floats / vertex instead of 3 ? I mean 4 floats for position, 4 for normals (how is that possible ?), 4 for vertex colors … etc …?

> Interleave as much as possible.

Is that one big vertex buffer object for all data, or is it something I missed ?

> Run VTune and make sure you’re not hitting a software unpack path, where the driver would read you data back and somehow reshape it before sending it to the card.

I’ll run VTune, … even if I didn’t understand what you adviced …

SeskaPeel.

The mesh has a total of 8 buffers :

  • positions
  • colors
  • normals
  • tangents
  • binormals
  • mapping channel 0
  • mapping channel 1
  • mapping channel 2

As a matter of fact, mapping channel 2 and 0 are the same, but it becomes complicated to write different vertex programs for all possible cases.
And, I could remove binormal buffer too. By testing, I found that 6 buffers was OK, even if it’s still the bottleneck. I certainly could balance by adding more geometry and increasing resolution.

Any thoughts on this ?

SeskaPeel.

Can you interleave some of the data?
I guess, at a minimum, position/normal/binormal/tangent could be lumped into a single VBO as they should all change together (if they change at all?).

That would yield the following set of VBOs
1)interleaved position/normal/binormal/tangent
2)color
3)mapping channel 0
4)mapping channel 1
5)mapping channel 2

As 3 and 5 are equal, you should really try and work that out to only use a single VBO. You can use the same VBO, but change the gl*Pointer call to bind it to a different attribute without rewriting vertex shaders.

Zeckensack :

First, great idea for the pointer, it can be done outside the vertex program and that will save me a full load of program rewriting. Problem is that it stills pass 2 times the same buffer (this time pointing to the same memory) to the vertex program. Will that be optimized anyway (bandwidth) if it come from the same buffer, compared to comiong from 2 separate buffers ?

Second, I really don’t get what’s the deal with interleaving … all my memory is STATIC at this point. Will there be some boost if I pass a single VBO instead of passing a separate one for each buffer ?

Thanks,
SeskaPeel.

Originally posted by SeskaPeel:
Zeckensack :

First, great idea for the pointer, it can be done outside the vertex program and that will save me a full load of program rewriting. Problem is that it stills pass 2 times the same buffer (this time pointing to the same memory) to the vertex program. Will that be optimized anyway (bandwidth) if it come from the same buffer, compared to comiong from 2 separate buffers ?
Not sure. I’d rather not rely on it. I think I remember earlier VBO supporting drivers could even crash on this kind of stuff.

Second, I really don’t get what’s the deal with interleaving … all my memory is STATIC at this point. Will there be some boost if I pass a single VBO instead of passing a separate one for each buffer ?

Thanks,
SeskaPeel.
A vertex from an all interleaved array can be fetched in one piece, while your current setup requires fetches from seven scattered locations.

Interleaving will primarily give you long coherent bursts from memory. DDR memory has this funny behaviour in that bandwidth efficiency drops dramatically, unless you use long bursts.

There may also be an upper limit on concurrent transactions the memory controller can handle, so you really make life easier for the GPU if you aggressively interleave data.

If you’re not familiar with setting up interleaved arrays, just ask

Another thing regarding interleaving:
Even if your data is completely static, there are cases where interleaving is bad. Eg, taking your current setup, if you frequently render only using position and color, the remaining vertex attributes are useless baggage. If that’s the case, you shouldn’t lump position and color together with the other stuff.
This is just a general suggestion for vertex transfer limited scenarios and it, of course, depends on the workload and should be benchmarked.

zeckensack :
I’m not familiar with that, consider this as a question

I use those buffers for several uses. For geometry rendering, and for silhouette rendering too (silhouette strokes for npr, and stencil shadow volumes).
I’ll soon use a z only first pass, and maybe I’ll need an additional pass for shader effects.

At this point, the workload on silhouette strokes is negligeable, and on shadow volumes too. The scene is “low poly” for the video card the app is running on (< 100k polys), so I don’t think using interleaved arrays where I should use non interleaved could be slowing down anything.

Any hint for interleaving ?

SeskaPeel.

Okay,
I’ll do this with system memory arrays now, as the difference should be more apparent.

Setup code for non-interleaved arrays proabably looks sth like this:

//allocate the arrays
vec3* position=(vec3*)malloc(vertex_countsizeof(vec3));
vec3
normal=(vec3*)malloc(vertex_count*sizeof(vec3));
<…>

//fill the arrays
for (int i=0;i<vertex_count;++i)
{
position[i]=<…>;
normal[i]=<…>;
<…>
}

//prepare the arrays for rendering
glVertexPointer(3,GL_FLOAT,sizeof(vec3) *,position);
glNormalPointer(GL_FLOAT,sizeof(vec3) *,normal);
<…>
glEnableClientState(<…> );

//render
glDrawElements(<…> );

*could be zero, too, because that tells the GL the data is tightly packed. This isn’t possible for interleaved arrays, as we’ll note shortly

Now we’ll turn that into interleaved arrays:

//we need a packed vertex structure
struct
Vertex
{
vec3 position;
vec3 normal;
<…>
};

//allocate the single, interleaved array
Vertex* vertex=(Vertex*)malloc(vertex_count*sizeof(Vertex));

//fill the array
for (int i=0;i<vertex_count;++i)
{
vertex[i].position=<…>;
vertex[i].normal=<…>;
<…>
}

//prepare the array for rendering
glVertexPointer(3,GL_FLOAT,sizeof(Vertex),&vertex[0].position);
glNormalPointer(GL_FLOAT,sizeof(Vertex),&vertex[0].normal);
<…>
glEnableClientState(<…> );

//render (unchanged)
glDrawElements(<…> );

Setup code for interleaved data from a VBO is slightly trickier. Worth another post, I guess

Awww, sh*t, no color tags for me
I hope you still get the meaning of the supposedly red star.

Yes, I got the meaning.
Eagerly waiting for the VBO code, plain vertex array was not a problem.

SeskaPeel.

First part is unchanged vs the system memory interleaved stuff.

struct
Vertex
{
   vec3 position;
   vec3 normal;
   <...>
};

//allocate the single, interleaved array
Vertex* vertex=(Vertex*)malloc(vertex_count*sizeof(Vertex));

//fill the array
for (int i=0;i<vertex_count;++i)
{
vertex[ i].position=<…>;
vertex[ i].normal=<…>;
<…>
}

//allocate a vbo and copy the data into it
GLuint vbo=0;
glGenBuffersARB(1,&vbo);
glBindBufferARB(GL_ARRAY_BUFFER_ARB,vbo);
glBufferDataARB(GL_ARRAY_BUFFER_ARB,vertex_count*sizeof(Vertex),vertex,GL_STATIC_DRAW_ARB);

//we can discard the system memory copy now
free(vertex);

//prepare the array for rendering
glBindBufferARB(GL_ARRAY_BUFFER_ARB,vbo);
glVertexPointer(3,GL_FLOAT,sizeof(Vertex),BUFFER_OFFSET(&vertex[0].position)-BUFFER_OFFSET(vertex));
glNormalPointer(GL_FLOAT,sizeof(Vertex),BUFFER_OFFSET(&vertex[0].normal)-BUFFER_OFFSET(vertex));
<…>
glEnableClientState(<…> );

//render (again, unchanged)
glDrawElements(<…> );

The BUFFER_OFFSET - BUFFER_OFFSET sequence will result in zero for the position pointer, and sizeof(vec3) for the normal pointer. You could hardcode that, but it’s safer this way. You won’t have to change this once you start twiddling with your vertex layout.

The BUFFER_OFFSET macro is given in the ARB_vbo spec, in case you don’t have it defined.

You could also ditch the system memory copy if you glMapBufferARB the vbo and write the data directly to the buffer. Maps, however, are unsafe as of specification while glBufferDataARB is guaranteed to work.

edited for UBB markup …

[This message has been edited by zeckensack (edited 09-21-2003).]

I’ll implement this tomorrow morning (sunday 9:30 pm in France, time to go eating and calming down wife because working on sunday) and I’ll let you know if it improved rendering, thanks for the sample code.

SeskaPeel.

“Use good alignment on your data types and vertex arrays.”

What does this mean ?

SeskaPeel.

It means try keeping the arrays on at a byte boundary, like 32bits.

Say you have an array of colors (8bits/channel, RGB, unsigned byte), that only aligns to 24bits, so to keep it happy with the hardware, you could just add the alpha component in (even if it isn’t used) to align it to 32bit (which may, or may not help preformance, some drivers will do this for you though).

Here’s my understanding of good alignment:

The start of the array should be aligned on 128 bytes. (Pentium 4 L2 cache line size)

Each data component should be naturally aligned (floats on 4, shorts on 2, etc)

If possible, each vertex should be a power of 2 in size, or nicely divide into powers of two.

Note that NVIDIA documentation for GeForce 2 claims that shorts and floats are accelerated for geometric data, and ubytes for colors. I believe modern cards accelerate at least as much. Smaller data is better.

Some cards have limits on how many buffers they can efficiently use. For some popular cards, that number is 2. For some cheap, off-brand cards, that number is 1. Thus, interleave like crazy.

5-6 different buffers isn’t going to cut it. Each vertex read may then require 6 separate DRAM set-up latencies. If you interleave everything into a single buffer, you only pay the DRAM line fetch penalty once per vertex. On-card caching may change, but not remove, this penalty.

What is the effect of interleiving the vertex attributes on the CPU?

Say, if I have to run an agorithm on a certain set of objects on the CPU, and then use VBO with a DYNAMIC flag.

According to one old Intel document, it is better to NOT interleive. I guess it was about the P2 and P3 era when it was written.

What a pain in the **** this is.

I use :

vertex = 3 floats
normal = 3 floats
tex0 = 2 or 3 floats or not present
tex1 = 2 or 3 floats or not present
tex2 = 2 or 3 floats or not present
tex3 = 2 or 3 floats or not present
color = 4 floats or not present
tangent = 3 floats or not present
binormal = 3 floats or not present

The start of the array should be aligned on 128 bytes. (Pentium 4 L2 cache line size)

That is meaningless. The whole point of (static) VBO’s is to get the CPU out of the equation. As such, if the line size of the CPU cache matters in terms of performance, then the CPU is part of the equation, and the implementation or user has already failed in terms of performance.

For dynamic VBO’s, this might mean something.

Some cards have limits on how many buffers they can efficiently use. For some popular cards, that number is 2.

I’ve never heard anything about this before. There is a good reason to interleve based on linearity of data (and that DRAM likes linear reads), but I’d never heard anything about a numeric limit on the number of VBO’s to use.