I am using VAR/Fence to draw approximately 200,000 quads. I get around 16 frames or so (a bit over 3 million quads per sec, or 6 million triangles).
The way i draw is i bind a texture, check if the previous fence is finished, copy the vertices for that texture to VAR memory with memcpy(), then render with drawElements(), and finally set the fence. This is in a for loop that cycles over all the textures.
I think thats pretty bad in terms of performance, but i think i’m doing the VAR stuff right… does anyone else have any suggestions.
Note that i do enable all the state thats required to draw the polygons (which are gouraud shaded with 1 texture).
I’m pretty sure that memcpy is not the most efficient way to copy to AGP memory, but I’m not positive.
In any case, you should be double-buffering your AGP memory. That is, you should be writing to one block of AGP while the graphics card is reading vertex data out of the other one.
If you’re going to be simply copying all your data every frame anyway (instead of keeping it static or generating it) you could probably just as well (or better) use regular vertex arrays and get the best possible copy performance NVidia managed to squeeze out of the machine.
I wouldn’t depend on DrawElements() being fast without VAR – in my timing, it isn’t.
memcpy() is suprisingly decent at moving data from cached memory to un-cached memory, because, well, it’s writing to un-cached memory and thus doesn’t have the write-allocation problems that memcpy() has on regular cached memory.
I think the problem might be more in checking the previous fence before starting again. That way, you are synchronous, and get no overlap. I would suggest you go the double-buffering route by splitting your VAR in two areas, and use a fence per area. First fill up the first area with geometry being drawn; when that’s full, set the fence and move on to the next. When you move into a new area, test the fence for THAT area. That way, you’ll get decent overlap in your transfers to the card.