VAR Weirdness and Write Combining

JelloFish · April 15, 2002, 4:26pm

1st Question:
I’m having some problems getting var to work with my system. Basically I am switching my memory to be allocated by the wglAllocateMemoryNV function, then I am calling glVertexArrayRangeNV then glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);, afterword everything is left as normal. But alas all the vertices are jumbled. What kind of VAR mistake could cause this?

2nd Question:
What chipsets can write combine to multiple arrays?(I read previously P3’s but do athalons as well?)
As well can you obtain writecombing with a loop such that
for all verticies
{
perform an action
write result to VAR array
}
Is there a good document on the web about the best techniques to do write combining?

imported_jwatte · April 15, 2002, 6:35pm

If your vertex data is jumbled, you’re generating it wrong. Try doing the EXACT same code, but call malloc() instead of AllocateMemory(), and don’t enable/establish the vertex array range. I’m really meaning comment out those three lines, and copy the AllocateMemory line to a malloc() call, so you know that’s the ONLY difference. Then see if the data is still jumbled.

Regarding write combining: there are very limited write combining resources on any chip. On a pentium III, any L1 cache miss is likely to cause contention for write combiner resources, and the CPU is really quite willing to partially evict your half-written line so that it can move an L2 cache line closer to the CPU.

Thus, pre-touch all data you will need for a block of verts, so that it’s all in L1. THEN write-combine out, making sure you overwrite ENTIRE cache line aligned blocks.

Devulon · April 16, 2002, 2:21am

I know AMD has a really good article on a variety of different prefetch methods. THe obvious, using PREFETCH commands with assembly as well as C code to trick/convince the processor to load multiple cache lines with your data. Its on the AMD website in the developer section. I don’t know exactly where but dig around and you should be able to find it. I would like to think that most of the non AMD specific methods would work quite well on Pentiums. The architecture really is quite similar. I definately think that the C version of cache loading will work almost exactly the same on Intel. I am sure Intel has equivalent documents as well. I know they did a while back but its been a while since I have been to there website.

Hope this helps.

Devulon

JelloFish · April 16, 2002, 11:58am

If I allocate memory using var but dont enable var, everything seems to render correctly. (Just really slowly)

knackered · April 16, 2002, 12:04pm

Are you changing the vertices every frame?

JelloFish · April 16, 2002, 12:19pm

Several times per frame. I am calling every form of flush before the data is written the second time(untill I get fence going). Interestingly enough the render is less jumbled when the flushes are then compared to without. And when I say jumbled I mean some faces are there some aren’t. None of the faces have the correct texture coordinates. Some faces are going off into infinity. ext.

imported_jwatte · April 16, 2002, 6:47pm

PREFETCH has absolutely nothing to do with write combining.
Are you sure you enable the VAR correctly? Specifically, VertexArrayRange() takes a number of BYTES as parameter, whereas most other GL operations operate on larger units (pixels, vertices, what have you).

knackered · April 16, 2002, 11:01pm

If you’re not using fences, then just glDisable(GL_VERTEX_ARRAY_RANGE) every frame, which flushes the VAR stuff as a side effect.
You really should use a fence though (and when you do use a fence, and want to disable VAR every frame, use glDisable(GL_VERTEX_ARRAY_RANGE_WITHOUT_FLUSH); )

wimmer · April 16, 2002, 11:43pm

Try to align all your arrays on 32 byte boundaries and let me know whether that works. I had a similar problem once, and this fixed it for me.

Michael

JelloFish · April 17, 2002, 8:26am

That was it! Thanks.

Where is that documented anyways?

JelloFish · April 17, 2002, 10:13am

Hey if I had to copy from non write combined memory to write combined memory what is the fastest way? Is memcpy any good?

wimmer · April 17, 2002, 11:20am

It’s not documented - I had to find out the hard way… (and I would really appreciate a good explanation for this)

memcpy actually performs quite well for copies to AGP or video memory (if fastwrites is enabled). On some machines you might gain some % using SSE/MMX, but I doubt it’s worth it (your bottleneck is likely to be elsewhere)…

Michael

imported_jwatte · April 17, 2002, 5:32pm

Copying to AGP is one of the few cases where memcpy() is good. Copying regular memory to regular memory, memcpy() is pretty poor (as implemented in the MSVC 6.0 library and GLibc, anyway).

The documentation for VertexArrayRange is that the memory must be aligned on 4 byte boundaries IIRC. If you weren’t aligned on 4 byte boundaries, then aligning to 32 will certainly fix that You might want to go back and align to “only” 4 to see if that, too, helps.

32 bytes is the fetch buffer/write combiner/cache line size on a Pentium III. That shouldn’t have much to do with your AGP memory, except if you (or the driver) forget to use SFENCE properly and you don’t do complete-write-combiner overwrites.

wimmer · April 18, 2002, 6:04am

If I read the spec correctly, 4-byte alignment is only necessary for NV10 (Geforce2)… For NV20, there aren’t any pointer alignment restrictions (except that <pointer> must be 32-byte aligned, which I take to be the pointer to the begin of the VAR-memory range.)

Michael

JelloFish · April 18, 2002, 9:12am

I’m certain and according to source safe I was already 4 byte aligning, but today when I switch back to 4 byte everything seems to work fine. I think it might be a deeper bug(or something that had more to do with another piece of code than the alignment).
Hopefully the bug will reoccur and I can track it down.

wimmer · April 19, 2002, 5:02am

so are you on nv20 (gf3) or nv10 (gf2)?

Michael

Devulon · April 19, 2002, 5:09am

memcpy is fast but not the fastest. THe main reason memcpy is slow is that it copies one byte at a time. If you are using floats for the verts (which you probably are) you need to copy 4 bytes at a time. Wait I am on to something 4 bytes = 32 bits. Which is a float.

Learn to program assembly. Its four lines to write a memcpy function that copies 4 bytes at a time. I wish I could give you source but to be honest its been a while since I have done it. (Although I really should get back into the habit). Let me go look at the intel website and I will post the code.

Devulon

wimmer · April 19, 2002, 9:50am

There is source for plenty of memcpy routines available on the net.

As jwatte pointed out, memcpy is actually very fast for AGP/Vidmem, and I don’t think you will need anything faster for any real-world application (at least not until AGP8x or more comes out…)

Michael

system · April 19, 2002, 10:17am

I beleive there are better versions of memcpy that copy 32 bits at a time, and can handle non 4 byte divisible array sizes. I already have something like this.

I havent tried it, but I heard using MMX is better for this since you can move 64 bits at a time.

Does anyone know if there are instructions for copying large chunks of data? Something that can move 1 KB with a single instruction perhaps?

V-man

imported_jwatte · April 19, 2002, 7:01pm

Please, people, read the fine source before posting on this forum. If you don’t, you’ll just end up perpetuating bad myths.

The MSVC implementation of memcpy() turns into a REP MOVSD, which copies 32 bits at a time, with minimal loop overhead. Any optimized UNIX libc will do a similar thing.

The issue is more that the CPU is so much faster than the memory subsystem these days, that copying longwords is not really faster than copying bytes :-/

When copying to cached memory, memcpy() wastes a lot of time write-allocating cache lines, which leads to pretty poor performance. Any “plain” instruction copy operation will have the same problem. The way to get copy to cached memory to go fast is to bypass the cache for the output buffer, or if you’re on AMD or PPC, to pre-clear the output buffer cache lines.

When writing TO AGP memory, you’re writing to un-cached memory, so the write allocation is not a problem. You can get some amount of speed-up by properly streaming DRAM pages and pre-warming the cache for the input buffer, but that’s about it. And it’s not like the ratio CPU : Memory speed will go DOWN anytime soon, so it’s only bound to get mooter.