A little help with VAR....


I’ve been playing with VAR to see what the speed-ups are like. Unfortunately, I can’t seem to get it working properly.

My test case is this:

1024 particles, 2 triangles each, drawn by glDrawElements(GL_TRIANGLES…). Textured and coloured, no lighting or normals. Particles change their position every frame, and so the vertex array is refilled, sequentially.

If I allocate the vertex array with new[…], I get about 250fps. If I allocate the vertex array with
wglAllocateMemoryNV(…,0.2,0.2,0.5) then the frame rate drops to about 120fps. The vertex array is never read from. The texture and colour arrays are in normal memory. Vertex types are 3 * float per vertex, 4 floats for colour and 2 floats for texture.

Furthermore, if I call glVertexArrayRangeNV(…) with the correct range, and then enable GL_VERTEX_ARRAY_RANGE_NV, everything goes a bit wrong - textures are messed up, not bearing any real relation to the originals.

Anybody got any idea what I’m doing wrong?



Try putting your vertex location, texture
coordinates, and color all in the same block
of memory allocated by AllocateMemory. You
should separate the parts that change per
frame with the parts that don’t, and you
should update the part that does change in
sequential (linear) order.

It may be that VertexArrayRange doesn’t work
well when the range you specify spans multiple
kinds of memory (say, memory you got from
malloc and memory you got from AllocateMemory).
Also, it may be that the VAR extension doesn’t
like it if you try to pull some data from the
range, and some other data from outside the

You should definately place all your vertex information(color, tex, vertpos… etc) in the same memory area - VAR memory. If you place the used data different places, the Geforce has to go collect it for every particle, which is slow.
Indices should only be placed in ordinary sysmem - allocated by new() or malloc().

Also, remember only to allocate one buffer, and don’t call glVertexArrayRangeNV more than once - only at the startup.

So you are saying that interleaved arrays are much faster?

Whether the arrays are interleaved is almost irrelevant.

In many cases, it is easier/faster to use non-interleaved arrays. For example, if you want static texture coordinates but dynamic vertices and normals, by separating out the static and dynamic data, you can use sequential writes (necessary for AGP performance).

  • Matt

Please explain sequential writes.

On x86 platforms, memory is either cached or uncached, i.e., you can mark a page as capable of being stored in the CPU’s cache or not.

Almost all memory is cached.

PCI uses cached memory for all its transactions. Now, consider what happens if a PCI device wants to DMA data from cached memory. It’s possible that the data in memory is out of date and that the “real” copy of the data is actually sitting in the CPU’s cache.

Likewise, if a PCI device writes data to memory, the CPU needs to mark the cache lines containing that data as invalid.

These problems are rather irritating for high-speed bus protocols, so with AGP, instead, uncached memory is used instead.

Reading from uncached memory with the CPU is, of course, very slow. Every time you read from it, you get a cache miss and a bus transaction.

Writes to cached memory would be slow, except that CPU designers put in a trick called write combining.

If your writes are aligned and sequential, the CPU will buffer up writes until it has a chunk large enough to do an efficient bus transaction.

So if you write 4-byte chunks to hex addresses 1000, 1004, 1008, 1010, 1014, etc., you will get efficient writes – the CPU will batch up a bunch of the writes, usually around 64 bytes of data.

If you write to 1002, 1006, 100A, 100E, …, write combining falls apart – it’s not aligned.

Likewise, writing to 1000, 1008, 1010, etc. looks like it is less work, but it screws up write combining, so it is slower than writing to every location.

Write combining also works with 8-byte MMX writes. In my experience it also works with 2-byte writes (i.e. mov [edi],ax) on my P3 with a BX chipset, but 2-byte writes are generally a bad thing.

Unfortunately, the Pentium 4 has a much worse write-combining unit than the P3. Specifically, any cache miss will flush the write combiner. This can be very problematic if you’re not extremely careful.

  • Matt

Okay, so how exactly do sequential writes fit in to what we’re talking about vertex arrays and such? I’ve heard it said that you should use vertices that have an x,y,z and a w. And I’ve heard it said that the reason for this is byte alignment. So is it considered a write if, in C++, I write something like:
vert.x = val;

So what considerations do I need to take into account to ensure I’m adhering to these sequential-aligned restrictions?

No, having a W that is always 1 is simply wasteful.

What it means is that if you write out interleaved data, for example, you should be doing something like:

for (i = 0; i < n; i++) {
buf[0] = vertices[i].x;
buf[1] = vertices[i].y;
buf[2] = vertices[i].z;
buf[3] = normals[i].x;
buf[4] = normals[i].y;
buf[5] = normals[i].z;
buf[6] = texcoords[i].s;
buf[7] = texcoords[i].t;
buf += 8;

That would be an example of reading from several arrays and writing using sequential writes.

This is an example of something that would be bad:

for (i = 0; i < n; i++) {
buf[0] = vertices[i].x;
buf[1] = vertices[i].y;
buf[2] = vertices[i].z;
buf += 4;

…because it skips 4 bytes each vertex.

  • Matt

when you call wglAllocateMemoryNV, try passing the following args:


check out my little sample at

maybe that will be of any help
or see the nvidia’s sample cass wrote
(look for VERTEX_ARRAY_RANGE on developer

Thanks for all the replies. I fixed it eventually. The problem was not allocating AGP memory for vertex coords, texture coords and RGB coords, and also not calling glVertexArrayRange over the whole block, not just the vertex coords.

Doesn’t seem to be an awful lot faster at the moment, but I think I’m upsetting the card somewhere else - drawing 2048 particles at two triangles each, textured, coloured and not lit is really pretty slow (~30fps, Athlon 900, GeForce2 Ultra).

Actually seems to me something might be more fundementally broken, since just clearing the screen in an 800x800 window yields about 64fps (this is with GLUT). VSync is definately off, 6.50 drivers. A more substantial engine, rather than this testbed thing, draws 12000 triangles at 120fps@1024x768, without using GLUT, and that’s only 1.4MTris/s, a far cry from potential performance with the card.

Bit of optimisation to do, by the look of things.



Also, another point about write combiners (Matt, please correct me if I’m wrong), is that there are multiple write combiners, so if you write sequentially to multiple arrays (non-interleaved arrays), you should still get write combining.

Thanks -

I think the P4 only has one WC buffer. (?) The P3 has 4, I think.

  • Matt

Just wanted to check something Matt: when you said
“Writes to cached memory would be slow” - should that be uncached memory? (I think so, otherwise why use write combiners - surely the memory writes then depend on the caching mechanism.)



Matt. From the software side of the application, aren’t interleaved arrays still faster due to efficient cache line usage, ther are a limited number of active cache lines. Is a write combiner doing a straight forward cache line write back?

My intel architechure is a littel out of date, stopped asm programming on the PI.


No, what is faster all depends on your data structures.

“Cache lines” do not apply to write-combined memory, since write-combined memory is by definition uncached.

  • Matt