Hardware T & L - problem (VertexArrrayRange)

I tried to implement hardware T&L in the graphicengine of Transport Magnat which will published in Europe in the States in coming October. In the GL-version we are using GlDrawArrays for the rendering of our landscape and our meshes. The performance is in my opinion much too slow, 15 frames if you are looking on a wood, 40, if you are looking just on the landscape with some flowers, trees and stones.
I experimented with glVertexArrayRange in a little demoprogram and the results there were absolutely great, 1.7 million polygons per second without VAR, 7.2 million polygons using var, so very similar results to the demo offered by NVidia (learning_var). So far so good. First of all I was very surprised that it was impossible to allocate AGP-memory, I debugged the learning_var-demo and even there it was impossible to do this. As alternative learning_var is using hardwarememory then (at least in the docu is written, that if your last param is 1.0, it’s hardware-mem). Ok, in the docu there’s written, the write access to this would be very slow, why can I write the vertexdata for 7.2 million polygons rendered through trianglestrips then, if it’s officialy slow? This is one of the facts, which are very mysteriously in my opinion, that AGP memory is anyway on absolutely no one of our PCs in the whole firm allocateable anyway. So, how ever, so we’ve allocated a block of memory now, nobody knows where it really is, but how ever. So the experiment-results were great, everybody in firm was saying “Wow, is T&L fast, I want a Geforce toooo!”.
Then the practical test to implement it directly in the gameengine. I tried to draw the trees with the help of VAR and the results were absolutely awful, without VAR 15 frames, with war 10 and my mouth hang down as if I’d seen a ghost. So far so bad. I looked for which costs frames and found out, that the enabling and disabling of GL_VERTEX_ARRAY_RANGE_NV is absolutely deadly. So I just still enabled it at the start of the program, outcommented all source, which could have problems with this and tried again. 20 frames without T&L, 19 frames with. And this on all PCs here, which are all using Geforce 2MX, but else with completely different configurations. =(

After two long neights of trieing to reach a great performance using VAR, I gave it up, absolutely no performance increase, with a lot of limitatins still the same framerate as without using VAR. The only positive thing I detected was, that the Z-buffer-test was faster and that if I set the viewrange to more far than the Z-far is set to, that the framerate really becomes a bit higher, but…that’s senseless for our engine, because we are normally anyway just drawing objects in range.
I am buffering all my objects and render them before the BufferSwap through just some less calls of DrawArrays.

So…is it really possible to increase the performance by the factor 4 through VAR, is there anyway a got solution for increaing the performance or is the learning_var-demo just a fake and gets all of it’s performance just through the matter that the most of the polygons are Z-culled as in my experiment-program as well through which it saves fillrate-performance?

   Michael Ikemann / Virtual XCitement Software Gmbh.

i can remember such strange behaviors when i started to use VAR but perhaps your init sequence (when allocating vertex array range)is wrong!! For sure you can get a damn boost when using static vertex data!!
here is a bit of code (as is sorry) perhaps it could help ya…
VR_BOOL VROPENGL::allocateCacheVertex(VR_VOID)
{
VR_LONG i;
char * ptr;

if (driverType != VR_OPENGLNV)
	return VR_FALSE;

if (pFastMem != NULL){
	glDisableClientState(GL_VERTEX_ARRAY_RANGE_NV);

	wglFreeMemoryNV(pFastMem);
	pFastMem = NULL;
}

for (i=0; (i < vrMaxCacheVerticeSize) && (pFastMem == NULL) ;i+=1024){
	cacheSize = vrMaxCacheVerticeSize-i;
	pFastMem = wglAllocateMemoryNV(cacheSize,0.f,0.f,1.f);
}

if (pFastMem == NULL){
	for (i=0; (i < vrMaxCacheVerticeSize) && (pFastMem == NULL) ;i+=1024){
		cacheSize = vrMaxCacheVerticeSize-i;
		pFastMem = wglAllocateMemoryNV(cacheSize,0.25f,0.25f,0.70f);
	}
	if (pFastMem == NULL){
		return VR_FALSE;
	}

	return VR_FALSE;
}

#ifdef _DEBUG
sprintf(dbgtext,"

(OPENGL:allocateCacheVertex) Infos:VertexArray cache size:%04d Kb",(cacheSize/1024));
OutputDebugString(dbgtext);
#endif

currentCacheSize = 0;
currentVertex = 0;

ptr = (char*) pFastMem;
pCurrentCacheVertex = (VR_BYTE*) ptr;

VertexArrayRangeNV(cacheSize,pFastMem);
glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);

return VR_TRUE;

}

gl

Hi Sylvain,
yes, I know, that there’s a lot depending on the way you allocate the memory, but the “0.25f,0.25f,0.75f”-version never works, but anyway just 0.f,0.f,1.0f. The writing into this memory is very fast, but when I want to flush it it’s even slower than in software. I’ve got one big buffer, which changes every frame at least one time. The polygons in it are sorted by texture and type (tristrip,tris,quads). I finally reached it, that it’s at least as fast as in software and it also seams, that the transfer of the vertices is faster, because if there are some polygons culleable as said in my first text already, then it’s also faster, but if all of them are visible, the result is awful compared to learning_var-demo. Do you have any demo.exe or c-code?
Here are short sourcecode of this experimental-unit (glNormalPointer…glVertexArrayRange-state) are enabled at the initialization already.
Unfortunetely the code is Delphi-based, because the inventor of the game began it in Delphi and it was unconvertable, but I hope you can understand it nevertheless:

unit GRA_TL;
//1.0 MI 29.05.2001
//Experimental T&L-unit

interface
Uses DataBase{$IFDEF MAW},Infoscreen {$ENDIF} ;

Type TLVertex = Record
VCoord : TVector;
VColor : TColor;
VU,VV : Single;
F1,F2 : LongInt;
End;

Const MaxTLVertices = 20000;

Type PTLVertexData = ^TTLVertexData;
TTLVertexData = Array [0…MaxTLVertices-1] Of TLVertex;

Const MaxVPBlock = 108*5; //Maximum vertexcount per block
MaxTexBlocks = 32;

Var TLTexBlocks : Array [0…2000] Of Record
BlockCount : Array [2…3] Of LongInt;
Blocks : Array [2…3,1…MaxTexBlocks] Of Word; //Index
Used : LongBool;
End;

Const MaxTLBlocks = 750;

Type PTLBlock = ^TTLBlock;
TTLBlock = Record
ActOffset : LongInt;
InUse : LongInt;
End;

Var TLBlocksInUse : LongInt;
TLBlocks : Array [1…MaxTLBlocks] Of TTLBlock;

Var TLVertexInUse : LongInt;
TLVertexData : PTLVertexData;

Procedure Init_GRA_TL;

Procedure GRA_TL_BeginFrame;

Function GRA_TL_GetBlock(Const TexNr : LongInt; Const VHigh : Byte) : PTLBlock;

Function GRA_TL_NewBlock(Const TexNr : LongInt; Const VHigh : Byte) : PTLBlock;

Procedure GRA_TL_Flush(Clear : Boolean);

Procedure GRA_TL_FinishFrame;

Procedure Destruct_GRA_TL;

Var AGPBlock : Pointer;
AGPChunk : Byte;
AGPChunkIDs : Array [0…3] Of LongInt;
AGPChunkUsed : Array [0…3] Of Boolean;

implementation

Uses {$IFDEF OPENGL}GRA_GL, {$ELSE}GRA_DX,{$ENDIF}Show3D;

Procedure Init_GRA_TL;
Var Z : Integer;

Begin
TLVertexData:=Nil;
{$IFDEF OPENGL}
If Options.usetandl then
AGPBlock:=wglAllocateMemoryNV(SizeOf(TTLVertexData)*4,0,0,1);
If AGPBlock<>Nil Then Begin
glGenFencesNV(4,@AGPChunkIDs);
glVertexArrayRangeNV(SizeOf(TTLVertexData)4,AGPBlock);
AGPChunk:=0;
TLVertexData:=Pointer(LongInt(AGPBlock)+AGPChunk
SizeOf(TTLVertexData));
End Else Options.usetandl:=false;
{$ENDIF}
If Options.usetandl=False Then
New(TLVertexData);
End;

Procedure GRA_TL_BeginFrame;
Begin
FillChar(TLTexBlocks,(TexCount+1)*140,0);
TLVertexInUse:=0;
TLBlocksInUse:=0;
// If Options.usetandl then
// glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV);
End;

Function GRA_TL_GetBlock(Const TexNr : LongInt; Const VHigh : Byte) : PTLBlock;
Begin
If TlTexBlocks[TexNr].BlockCount[VHigh]=0 Then Begin
Result:=GRA_TL_NewBlock(TexNr,VHigh);
End Else Result:=@TLBlocks[TlTexBlocks[TexNr].Blocks[VHigh,TlTexBlocks[TexNr].BlockCount[VHigh]]];
End;

Function GRA_TL_NewBlock(Const TexNr : LongInt; Const VHigh : Byte) : PTLBlock;
Begin
If (TLTexBlocks[TexNr].BlockCount[VHigh]=MaxTexBlocks)
Or (TLBlocksInUse=MaxTLBlocks) Or (TLVertexInUse+MaxVPBlock>MaxTLVertices) Then
GRA_TL_Flush(True); //If there are not blockindices storeable anymore or no free blocks anymore or no free vertices anymore, FLUSH!
Inc(TLTexBlocks[TexNr].BlockCount[VHigh]);
TLTexBlocks[TexNr].Used:=True;
Inc(TLBlocksInUse);
TLTexBlocks[TexNr].Blocks[VHigh,TLTexBlocks[TexNr].BlockCount[VHigh]]:=TLBlocksInUse;
TLBlocks[TLBlocksInUse].ActOffset:=TLVertexInUse;
TLBlocks[TLBlocksInUse].InUse:=0;
Inc(TLVertexInUse,MaxVPBlock);
Result:=@TLBlocks[TLBlocksInUse];
End;

var rexi : array [0…MaxTLVertices-1] Of TLVertex;

Procedure GRA_TL_Flush(Clear : Boolean);
Var Z,Z2,Z3 : Integer;
Typ : Cardinal;
{$IFDEF MAW} Count : LongInt; {$ENDIF}
F : File;

Begin
{$IFDEF OPENGL}
If TLBlocksInUse=0 Then Exit;
glVertexPointer(3,GL_FLOAT,SizeOf(TLVertex),@TLVertexData^[0].VCoord);
glColorPointer(3,GL_UNSIGNED_BYTE,SizeOf(TLVertex),@TLVertexData^[0].VColor);
glTexCoordPointer(2,GL_FLOAT,SizeOf(TLVertex),@TLVertexData^[0].VU);
{$ENDIF}
For Z:=0 To 2000 Do If TLTexBlocks[Z].Used Then Begin
GRA_SelectTexture(Z);
For Z2:=2 To 3 Do Begin
{$IFDEF OPENGL}
Case Z2 Of
2 : Typ:=GL_Triangles;
else Typ:=GL_Quads;
End;
For Z3:=1 To TLTexBlocks[Z].BlockCount[Z2] Do Begin
With TLBlocks[TLTexBlocks[Z].Blocks[Z2,Z3]] Do Begin
{$IFDEF MAW}
Count:=InUse;
if z=2 then infowin.add(@InfoWin.Gra_VertexUsed,Count div 3)
else infowin.add(@InfoWin.Gra_VertexUsed,(Count div 2));
infowin.add(@InfoWin.Gra_PolyUsed,Count);
{$ENDIF}
glDrawArrays(Typ,ActOffset-InUse,InUse);
End;
End;
{$ENDIF}
End;
End;
If Options.UseTAndL Then Begin
AGPChunk:=(AGPChunk+1) mod 4;
{$IFDEF OPENGL}
glSetFenceNV(AGPChunkIDs[AGPChunk],GL_ALL_COMPLETED_NV);
glFinishFenceNV(AGPChunkIDs[AGPChunk]);
AGPChunkUsed[AGPChunk]:=True;
{$ENDIF}
TLVertexData:=Pointer(LongInt(AGPBlock)+AGPChunk*SizeOf(TTLVertexData));
End;
If Clear Then Begin
FillChar(TLTexBlocks,(TexCount+1)*140,0);
TLVertexInUse:=0;
TLBlocksInUse:=0;
End;
//assignfile(f,‘test.dat’);
End;

Procedure GRA_TL_FinishFrame;
Var Z : Integer;
Begin
GRA_TL_Flush(True);
{$IFDEF OPENGL}
If Options.UseTAndL Then
GlFinish;
{$ENDIF}
// If Options.usetandl then
// glVertexArrayRange
End;

Procedure Destruct_GRA_TL;
Begin
{$IFDEF OPENGL}
If Options.usetandl Then
wglFreeMemoryNV(AGPBlock)
Else
{$ENDIF}
Dispose(TLVertexData);
End;

end.

Thanks,

 Michael / VX

p.s. We are allocating the memory with the parameters 0.f,0.f,1.0f.

p.p.s. All we need to do is to flush the buffer the fastest way, but how?

MrCalab,

If you can’t allocate AGP memory, then the driver probably can’t either. I had this problem with a system of mine where the (i815 ?) chipset was a slightly different rev than my Win98 disk was expecting, and I was unable to allocate AGP memory. A friend found the right drivers, and solved that problem.

If you can’t allocate AGP memory, then make sure to get that corrected - overall performance (not just VAR) will suffer otherwise.

The other thing to note (that should be in the learning_VAR whitepaper) is that switching VAR on and off is very expensive (causes a full flush) on older drivers. You should stick to VAR and immediate mode only to get better performance.

Once drivers that reduce the cost of turning VAR on and off are available, we’ll provide details on how to use it.

Until then, just leave it on or use immediate mode for best performance.

Hope this helps -
Cass

McCalab, you seem to be very confused with VAR. VAR is NOT T&L! You’ve got T&L automatically if your hardware supports it, wether you use VAR or not. VAR is simply a modification of the specification that uses much less overhead in bandwhich ( by not having to keep a copy of vertex arrays in memory till drawing ).

First thing, if your application is not bandwhich limited, you’ll see no gain from using VAR. Check if your application is CPU or fillrate limited.

Second thing, VAR’s allocated memory is not cached, so you must access it sequentially or your performance will be disastrous.

Finally, make sure your drivers are updated… i’ve had a problem with VAR that caused a crash on my system, and found it was a bug in the driver. Updating the driver fixed it.

Y.

Refer to the NV_vertex_array_range2 extension for enabling/disabling VAR w/o a flush.

  • Matt

Hi Ysaneya,
yes, of course you’ve got any T&L also without using VarRange, but just software T&L and not hardware accelrated, at least this is written in every documentation I’ve readen about this till now. And looking at the learning_var-demo, the difference between having var enabled or disabled is unbelievable big and both versions are using DrawArrays, so I think there’s an unbelievable amount of possibilities this command offers. At the moment there are still some things I need to implement into the KI (künstliche Intelligenz), means Artificial Intelligence, but I’ll try my best to write a little C-demo rendering some hundred trees, one time using immediate mode, one time using DrawArrays and…I hope someone of you can optimize this using VarRange, in any case thanks for your support. I’ll write it and upload it onto a server as fast as I can and will post the url here right after it then.

Thanks,

   Michael / VX

[This message has been edited by MrCalab (edited 05-31-2001).]

>>of course you’ve got any T&L also without using VarRange, but just software T&L and not hardware accelrated<<

This is totally wrong. You should do your homework about how graphics accelerators work.

Hi Relic,
I don’t know, what you mean with homework, but I read every documentation I can get into my figners and following is a part from Var-Fence-Presentation Cass wrote:

What is NV_ vertex_ array_ range
(a. k. a. VAR) ? (2)
 Compiled vertex arrays improve this somewhat
 Relaxes coherency requirements
 Lock/ Unlock semantics
 More room to optimize
 Usually requires lots of redundant copying
 App could do better memory management
 Introduces index bounds
 But not explicit memory bounds
 For multipass rendering
And these are the lines I ment:
 Can re- use transformed vertices (!SOFTWARE! T& L)
 Can put data in AGP/ video memory (!HARDWARE! T& L)

Michael

And Relic…about graphicaccelrators:

As I know you send a graphicaccelrator a triangledata, in simplest version just three vertexcoordinates, whereby theirs content is the position on the 2D-screen plus the depth-position for every vertex, which can be a value from 0.0f up to 1.0f. The vertices need to be transformed and lighted by the driver so that the needed 2D-screen and Z-coordinate result. As feature of the GeForce2 this calculation, so where on the 2D-screen the triangle has to be rendered. And about my homework: The mainfeature of 3D-cards is that they can unbelievable fast blit textured polygons into a twodimensional buffer. This is what I always thought till now, if there’s something completely wrong on this, correct me please.

  Michael / VX

Hi mcraighead,
what’s the numerical value for “NV_vertex_array_range2”?
Yes, the glFlush costs a lot, but one thing I’ve also been a little bit surprised about was that even glEnableClientState(GL_VERTEX_ARRAY_RANGE_NV) costs really a couple of frames, even when it has been enabled already before this call. In the docu there was written, it’s because a new DMA handler is created every time you enable this state, what a sense does this make?

Thanks,

     Michael

I see, that’s from the VAR_fence.pdf.
That’s only one possibility to make use of the data in the VAR and I haven’t seen that, yet. Though I think that’s not what’s advertised in the learning VAR program.

The desciption you gave for a 3D accelerator fits to the class of chips which do only the triangle setup theirselves (Permedia 2, TNT2, Rage, etc.)
That’s pretty much outdated with the new GPUs, which know how to transform model data. Here you throw in the transformation matrix and the vertices in MODEL coordinate system and everything else is done in HW, including the setup before rasterization from above.

That is why you seem to be confused. T&L has nothing to do with VAR. Software T&L means that transforming vertices and calculating lighting is done by the driver on the CPU. Hardware T&L means it is done on the GPU ( ie the video card ). Wether VAR is enabled or disabled, if you’ve got a T&L card such as a GF1/2/3, it will be done on the GPU.

VAR saves bandwich by removing the need for the driver to keep a copy of the data. By using VAR, you are actually telling to the driver: “do not copy the data, use the one i’m providing, i’m giving you the assurance that i won’t modify it”. Again, it does not enable hardware T&L.

Y.

Moreover, with VAR you can also decide of the best vertex structure for the card and perhaps you also get the most interesting features : vertice can be stored onboard which means that there are no more traffic via this ‘DAMN’ slow bus!! (call it as you wish-> AGP,PCI,VLB they are all the same s h i t!!)
When using this kind of feature (which is the best for static data) you’re far from displayLists mechanism performances!!! so so far! Anyway as a precision i would like to say that with VAR you get the ‘T’ for T&L because GPU can access & transform directly the vertice onboard!! i think this is main point compared to other GL mechanism!! Of course with displayList as an example you got T&L but mainly for 'L’ighting don’t tell me that lists are stored staticaly onboard just like VAR!! Coz i can see the CPU sending the primitives!!! )

Hi all,
thanks for all the informations. Hm…so…if there’s a GeForce used for the rendering, there is anyway it’s T&L-chip used automatically, with as without VAR. So far so good. Ok, lets tell about my problems:

There are four categories which take our performance, depending on where on the map you are this differs of course.
If you are in a wood the framerate remorseless crushes down, with LOD as without, this is the category trees, because there are not so much of them, here is surely a lot of room for the optimization through static-models saved directly on the card.

Category 2 are the houses. Unfortunately we’ve too much 3D-graphicians, just kidding =), no, in any case we’ve some hundred different houses and…as much memory they would take. Here is the question…how much time does it take to save a house in graphic-cards memory? If it’s surviveable fast, it would be possible to always hold the last displayed houses in this memory and always if a new one needs a place in it the “oldest” one could be thrown out.

Category 3 are the vehicles, trains and waggons, here of we don’t have so much and also won’t have so much in the future, may be 80 all in all and not so much different ones at the same time, here’s surely a room for optimization through static data.

Category 4 is the landscape, it changes every time the user is moving, so no room for static optimzations, also no optimization through shared vertices, because trianglestrips are used as muich as possible.

The only actual way I know to reserve memory is with the parameters 0.0,0.0,1.0 and anyway the driver doesn’t allow me any other way, how to handle static-data? How to upload it to the card and to display it?
As promised I’ll try my best to still upload a little demo today, in any case thanks a lot for your informations and support.

   Michael

>>If you are in a wood the framerate remorseless crushes down, with LOD as without<<

If it is independent on number of vertices it’s probably fillrate or state change bound.
Check how the perfomance in a low LOD behaves with a very small window.
If the performance increases, you’ll have not many choices for improvement other than reducing the number of pixels drawn (or reducing the color res).

>>Category 2 are the houses. … If it’s surviveable fast, it would be possible to always hold the last displayed houses in this memory and always if a new one needs a place in it the “oldest” one could be thrown out.<<

The houses in the screenshots look low geometry (more or less boxes) so it’ texture memory you’re talking about, right?
Some ideas for your own texture management are: Sort by texture, establish a working set of textures as texture objects. Use glTexSubImage for updating the contents of your working set. (Use the newest drivers).
If the driver starts thrashing textures because the onboard memeory is not sufficient for your working set, your performance will stutter. Checkout the different OpenGL 1.2 internalformats you can use with glTexImage and check texure compression (S3TC extension).

>>Category 3 are the vehicles, trains and waggons<<

Sound like perfect candidates for display lists and other static geometry methods.

>>Category 4 is the landscape, it changes every time the user is moving<<

You mean it is redisplayed from a new viewer position, but the model geometry hasn’t changed, right? This is also a static case!

The best optimizations are always those, which eliminate the need to render something. Cull high geometry objects (including chunks of landscape patches) with bounding boxes, etc.

Hi Relic,
to answer some questions first. Yes, we are using relative low geometry objects, lets say 40 quads=80 tris per house, they’re also sorted by texture already, so used textures per frame is equal to count of texture-switches, so there’s unfortunately no room for optimization anymore. The same about the culling, we’re already culling all we can before “sending” the data to OpenGl, means backface-culling as frustum culling as range-culling, so even there we are on the limit.

Because of the landscape:
No, it’s not possible to hold this static, our map is 1024x1024 quads big, so about one million fields, as you can imagine, it’s impossible to hold this static. Here it’s the same about the textures and the culling, everything cullable is culled, textures are sorted, so also here no room for optimization anymore. I’ve onetime been so insane to let the whole scene compile into a display list and on the following frame I just still displayed this, no effect on the framerate.

Because of the LOD of the trees, no, I guess you misunderstood me a little bit, the LOD has a lot of effect of course, so I don’t think we are here on our fillrate-limit, how can I store the trees direct in the video-card-memory? Same about the vehicles.

As said, there’s unfortunately no culling or texture-optimization possible anymore, we do the lighting by ourselves and it’s nearly for free, so also here no optimization anymore. The only thing which could still help us could really be to use S3TC, which could increase the framerate still a bit, but in the most of the scenes, it’s surely mainly the count of polygons we’re painting,
mainly of course the count of vertices.

In the landscapeengine I optimized a lot through the use of trianglestrips to elemenate the bottleneck here as much as possible, so also no optimization here anymore.

The only thing which could really still help is to get the geometry which is the minimum faster to the GPU, because of that also my tries with the use of T&L.

So…how can I “say”, “this object shall be stored on the videomemory”, “this one not” and so on?

Thanks,

    Michael

www.digitalprojects.com/way-x

First of all, if VAR isn’t letting you allocate AGP memory, then you may need to install your motherboard drivers.

Second, the basic idea with VAR is that you should allocate one large swath of memory and use all your vertex arrays from there.

I would suggest getting rid of some of the culling (backface, certainly. Gross fustrum culling is fine, but not at the polygon level), as that could certainly be taking up valuable CPU time.

To use VAR, you have to make sure that the current vertex array resides in the memory set in the VAR. You will, for static data, want to simply load up all your vertex arrays into the VAR memory at the beginning and just set the arrays and draw them from there. You can know beforehand how much memory you need for models and you can therefore allocate it to the size you need.

Or, if this is just too annoying, simply store these static objects in display lists and run them that way. It may not quite be as fast as VAR, but it’s superior to what you are currently doing.

For dynamic data, you will have to do some memory management. You will need to partition memory as you see fit, and load these memory with vertex data as needed by your application. Now, remember, do not attempt to use video memory for this task. Writes to video memeory are very slow. If you use AGP memory, you must write sequentially (ie, block-by-block in the order you wish it to go into the VAR memory).

Also, note: none of this may actually impact performance. If you’re running at a high resolution (or with loads of textures and multipasses), you are likely being bandwidth or fillrate bound. So sending vertex data faster isn’t going to help.