Strange NV_vertex_array_range bug!

Is it necessary to align vertex data in VAR memory to 32 byte boundaries? I ask because I have encountered a very strange behavior in VAR, which I cannot trace to any bug in my code, so I assume there might be a bug in the VAR implementation. I use VAR in the following way:

allocate large chunk of VAR (16MB)
for every frame:
for all objects (all static):
copy vertex data to VAR memory (memcpy)
draw object

The objects are simply copied consecutively, i.e., if object 1 goes to address a1 in VAR and has s1 bytes, then object 2 will go to address a2=a1+s1, until there is a wraparound.

Now here ist the problem:

If I don’t align vertex data on a 32 byte address, some triangles in the scene will just go wild. I can reproduce that with a very simple case, just drawing a rectangle in one triangle strip with 4 vertices, no colors/normals/texcoords, and 3 floats for each vertex (which gives 48 bytes for the object). I have made screenshots of the problem ( here is the intended image, this is what it looks like every second frame). Padding the object to 64 bytes fixes the problem. If I don’t use VAR, or if I step through the program with the debugger, or do something else between copying and drawing (like writing some debug output to the console), the problem goes away.

This is strange because the VAR-spec says that there are no alignment requirements on the NV20 (where the problem appears), only “<pointer>” needs to be 32 byte aligned, and <pointer> I assume to be the pointer to the whole VAR-memory allocated with wglAllocateMemoryNV.

I have no problem with aligning my data to 32 bytes, but this has cost me quite some time debugging, and (assuming it’s not a bug in my code) such a behavior should be documented.

I have one suspicion:

Could the problem lie with some caching issue? If the last chunk of vertex data does not fill a 32 byte cache line, it is maybe not yet written to memory when the GPU already tries to access it (because I immediately draw the object after copying). This would still be quite a strange, non-sequential effect, but it would at least explain this strange behavior…

Any suggestions are welcome,



[This message has been edited by wimmer (edited 03-09-2002).]

[This message has been edited by wimmer (edited 03-09-2002).]

Do the 32byte cache lines refer to the Geforce3? I’ll add some processor info, in case you didn’t already know (as you didn’t mention your CPU):
Pentium II/III - 32bytes per cache line
Pentium IV - 128 bpcl
Athlon/XP/Duron - 64 bpcl

So, if you run a K7 or P4 core, you can already discard your theory. If not, I’ll shut up now

Yes… every ‘partition’ that you made at the memory returned by AllocateMemoryNV has to be aligned. I really don’t know about NV20, but I’ve tested it with GF2 and it’s the right way to do it.

Anyway, by aligning the pointers to 32 bits it doesn’t waste a lot of memory but only a few bytes, so I think it doesn’t matter at all.

_> Royconejo.

I’ve read you post again and it seems that there is a misunderstanding…

the memory (pointer) returned by AllocateMemoryNV is always aligned to 32 bits and It’ll work fine as is. But if you intend to partition it, you have to follow the same alignment in every pointer taken from that buffer

I mean… do something like this

GLfloat* MyBigVARBuffer;
GLfloat MemUsed = 0;

InitVAR (GLuint bytes, GLfloat usage) {
MyBigVARBuffer = (GLfloat*)AllocateMemoryNV (bytes, 0, 0, usage);

GetVARMem (GLuint bytes) {
GLfloat* mem = &MyBigVARBuffer[MemUsed];

this will keep every pointer aligned adjusting the memory used so that it is always 32 bit aligned (everytime this function is called, it will return an aligned pointer)

GLuint mem_align = bytes % 4;
if (mem_align > 0) bytes += 4-mem_align;
MemUsed += bytes;

return mem; // guess what? its aligned

_> Royconejo.

[This message has been edited by royconejo (edited 03-09-2002).]

The extension spec states (for NV20):

For all enabled arrays, all of the following must be true:

  • the pointer must be 4-byte aligned
  • the stride must be less than 256
  • the stride must be a multiple of 4
  • the type must be FLOAT, SHORT, or UNSIGNED_BYTE

For NV-10’s, it gives a list of byte alignment for different types of pointers, but none of them are 32-byte aligned. Now, calls to glVertexArrayRangeNV have to use 32-byte aligned pointers.

The cache lines on the P-IV are 128 bytes for L2, but 64 bytes for L1. Of course, the cache line size doesn’t matter for AGP memory, which is un-cached; instead you want the size of the write combiners/line fetch buffers to avoid partial eviction stalls/read-backs. Think of them as L1-cacheline-sized.

As the spec says: 4-byte alignment, minimum. If you’re doing copy/draw/copy/draw, then you probably want to align on a LFB to minimize your partial stalls (and pad out with 0 so you overwrite the entire LFB after each block).

This has nothing to do with write combining; we flush the write combine buffers ourselves.

  • Matt

Matt, how can this behavior then be explained? Any ideas?

Btw, the problem appears on a GeForce Ti500 on Athlon XP (VIA) as well as MP (760MP). On a Geforce 2 + Celeron the problem does not appear.

royconejo: what you mean is having the pointers 32 bit aligned. I do that anyway (I even have them 16 byte = 128 bit aligned), however the spec states that even that would not be necessary on an NV20 (there are no alignment restrictions for NV20)! However, I need to keep pointers 32 byte aligned for it to work! Which is why I think it’s a bug.

I have the impression that the reason has got something to do with the fact that I immediately render the vertices as soon as I have copied them…


[This message has been edited by wimmer (edited 03-10-2002).]

And what about a GPU/CPU synchronization problem ? Are you using NV_fence ?


Yes, but NV_fence is not relevant here. As I said, I allocate a 16MB buffer, and the problem appears long before this buffer gets filled up.


Looking at the screenshots, it sort of looks like the w-coords of some of your vertices are 0. But I’m just guessing.

Well, I don’t assign w-coordinates, I use only 3 coordinates in the vertex array.

Hmmm… it’s not cache, it’s not write combining, so where does it come from? Nobody ever experienced a similar problem?


I had a similar problem on a Radeon 8500 on Win98 with the VAO extension. I finally found out that it was a problem in the driver. I installed a previous driver and it worked ok, so maybe you can try that too.

Excepting that, i would say it’s a synchronization problem, but if you guarantee you never write to a zone of memory that the video card in rendering from, then i have no idea.


Yes, I definitely guarantee that I never write to memory the GPU is rendering from, at least not in a way I could influence in any way. Looking at the memory in the debugger, the vertices are all correct.

So I guess you are right on both accounts, it must be a synchronization problem in the driver.

Matt, cass?