i am implementing a vertex array manager and now i want it to take full advantage of VAR. I also want it to be able to handle dynamic data, therefore it has to upload data every frame into the agp memory.
BUT: nVidia´s VAR description says “write sequentially to agp memory”. What does this exactly mean? To split data into smaller chunks instead of uploading everything in one go? Or maybe exactly the opposite? And if i should use chunks, of which size should they be? I think the paper could be a bit more specific about this.
However i still have you guys, so maybe you can explain to me how VAR and dynamic data should be used best.
It just means write byte 0, then byte 1, then byte 2, then …
Also, always write a multiple of 32 bytes. If you need to write 24 bytes, then write an additional 8 zeros. The AGP line size is 32 bytes; if you write all 32, it writes the full 32 bytes at once, but if you write less than that, it has to do multiple partial-writes. This means that writing 24 bytes takes about 3 (or more) times as long as writing 32 bytes, even if you don’t care about the last 8.
The specific line size you need to worry about is the line fetch buffer/write combiner size between your CPU and main RAM, because AGP memory is not cached.
On Pentium III, this is 32 bytes. On Athlon, this is 64 bytes. On Pentium IV, I believe this is 64 bytes, even though L2 cache line size is 128 bytes.
The important pieces to remember are to always align the base of your buffer to the start of a line (i e: your_pointer & 63 == 0), and to always write an entire line (all 64 bytes at a time). Don’t skip writing any bytes, even if you use “padding” or don’t need to update the value of one piece within the line.
The reason for this is that, if you write to parts of a line, but not all of it, when time comes to flush it out to RAM, the write controller has to read back in what was there before, compose it with the pieces that you wrote, and then flush it all back out. This is slow. Meanwhile, just flushing out ALL the data, because you overwrote all bytes, so it doesn’t have to read anything, is fast.
>>>On Pentium III, this is 32 bytes. On Athlon, this is 64 bytes. On Pentium IV, I believe this is 64 bytes, even though L2 cache line size is 128 bytes.<<<
Yes, P4 is 64 bytes. Also, new processor in the future may mean you may have to update your configuration file.
Previously, there were talks about how to do the copy. Some people, I think it was you jwatte, said that memcpy is OK, ut some people say that memcpy on some implementations does a byte copy, which is slow.
Why not use MMX or SSE for doing this?