Carmack's .plan

Originally posted by opla:
yes, and it’s much better than VAR, you don’t have to manage and synchronize the AGP memory.
you just call glNewObjectBufferATI() with a byte size and the pointer to your data, and you have an ID for your data in fast memory.
Then you use glArrayObjectATI() instead of gl*Pointer().

This is your opinion, of course. Mine is different. In my opinion, ATI_vertex_array_object has a small design flaw: you need to have your geometry in system memory to use it. For static geometry it is not a big issue but for dynamic geometry there is a double copy: you have to create it in system memory and them call to UpdateObjectBufferATI to use it (it probably makes a copy of the data to AGP memory).
With VAR you don’t have this problem as you write directly to AGP memory.

With position invariant programs, you don’t calculate o[HPOS] in your vertex program, and that allows the driver to do it a little more efficiently. It actually is equivalent to ~ 1-2 instructions less than if you did it yourself.

Sorry, Nutty, the details of this are not something we expose publicly.

Thanks -
Cass

Are you saying all I have to do, is not write out my transformed vertex position to o[HPOS], and I’ll get invarient results with mixed function multipass?

Thanks,
Nutty

Nutty: He’s saying that whey they use the “plain” OpenGL pipeline on the card, which is apparently NOT entirey composed of shader opcodes that we have access to, the number of clock cycles or opcode equivalents that a vertex transform takes is one less than if you were executing the semantically equivalent program, as defined by the vertex shader language available to us.

Opla:

yes, and it’s much better than VAR, you don’t have to manage and synchronize the AGP memory.

Except I can manage it more efficiently than they can, as I know what the pattern of writing and accessing is. If I allocate a VAR and split it in two, I only need to test a fence when I pass the end of a chunk, a la double-buffer. This happens maybe once every 10 or 20 meshes I render, depending on the size of the meshes.

Meanwhile, the ATI driver has to set a fence for each buffer I upload and render, as it can’t know whether I will soon ask to re-upload to that buffer or not. Or it will have to do mumbo-jumbo swithc-aroo behind the scenes, which degrades to very much the same thing in the end. They cannot do the double-buffering thing, because they don’t know the lifetime of each individual object upload I make.

Then there’s the problem of having to upload geometry to the buffer in the first place. If I’m dynamically generating the geometry, they impose an extra copy pass on me. I have no idea what the implementation is, but they may even blow my L1 cache when they upload the data, even if I’m conscientious about writing to memory with un-cached stores.

I’ve heard several times that the ATI people are amenable to adding an extension so you can get access to the buffer. If they do that, and relax synchronization so that I don’t have to synchronize per mesh, then they’ll be equivalent. Until then, the extension may appear simpler to use, but it’s simpler to use in the same way that glVertex3f() is simpler to use than glDrawRangeElements().

[This message has been edited by jwatte (edited 02-12-2002).]

Originally posted by jwatte:
I’ve heard several times that the ATI people are amenable to adding an extension so you can get access to the buffer. If they do that, and relax synchronization so that I don’t have to synchronize per mesh, then they’ll be equivalent. Until then, the extension may appear simpler to use, but it’s simpler to use in the same way that glVertex3f() is simpler to use than glDrawRangeElements().

I guess the new GL_ATI_map_object_buffer extension may be just this.

Nutty, they are deliberately not saying how this is done.

It looks like they gave Carmack access to a back door they are not ready to reveal. There is a different vertex program which doesn’t compute this information, but it probably involves doing something else that they don’t want to tell everyone about, at least not yet.

Just to be clear, position invariant programs are defined in NV_vertex_program1_1. I won’t elaborate on that spec until it’s public - which should be Real Soon Now.

Cass

There is also the fact that ATI’s Vertex_Array_Object extension is just available on Radeon 8500 but not on lower Radeons (7500, 7200 and previous model). So in those models you have no ‘fast’ way to pass vertex to GPU.
What ATI think about that?
How they suggest us to send the geometry on those cards? (On a Radeon 7500, I made a simple test using OGL standard arrays, with and without CVA. And them using D3D with Vertex Buffers. D3D version is about 30 times faster. With VAR extension on a GeForce card, OGL version is more or less the same speed than D3D version)
Anyway, it will be good for everyone to have a single and common way to do it for nVidia and ATI cards (and 3DLabs, Matrox, …). The lack of a common interface for this part of the pipeline, the fact that it is the same action for every GPU, and the fact that it has been discussed on ARB meetings (as you can read in ARB meeting notes) show the difficulties of the ARB to reach a consensus about a single and needed feature in modern cards.
In my opinion, this just reflect the lack of capacity of two IHV like nVidia and ATI to reach an agreement that will benefit OGL. And just for marketing reasons. It seems that they need a dictator (like MS with D3D) to do things right and useful for all of us.
Do you thing having two ways of send geometry to videocards benefit somebody?
Do you thing having two vertex programs APIs, for two different cards, to do exactly the same thing (small differences) benefit somebody?
It doesn’t benefit developers, it doesn’t benefit OGL and, this is the worst, it doesn’t benefit IHV (as you probably don’t use both ways, and maybe none).
I liked when there was people like Michael Gold from nVidia and Tom Frisinger from ATI that sit together and creates common extensions like texture_env_combine.

Zak, are you asking us or asking ATI?

You are saying that on some cards the fastest dispatch in D3D is 30 times faster than the fastest dispatch in OpenGL, I doubt it, did you try other methods?

Try using glDrawElements, at the very least the drivers will be optimized for this because of Quake3 benchmarking.

Here’s what Carmack (I think) wrote on Quake3 dispatch when advising IHVs on optimization:
http://www.quake3arena.com/news/glopt.html

Try sticking to these rendering paths on the card you have performance issues with.

Believe me. It is not a ‘real’ application because it draws the same model (with ~40000 faces, ~47000 vertex) in 16 different positions using just one texture with one directional light and infinite viewer.
Using standard arrays it has to send the geometry each time it draws the model (in each position) from sysmem to the card.
QIII uses CVA but using small chunks of vertex. As CVA ‘is not well defined’ it seems that IHV have written the extension just for QIII case. So it doesn’t seem to do nothing with my ~47000 vertex model.
With D3D I create an static VB so it seems that the model is stored in video memory.
With VAR, I can put the model in video memory and it is more or less same speed that D3D test. Or I can put the model in AGP memory and it depends on the AGP configuration that it can be more or less the same speed (with good AGP4x, fast writes, sideband, …), a bit slower (bad AGP4x configuration) or more or less half the speed (with AGP2x configuration).
Using standard arrays it is about 30 times slower.
Note that this is just a test. In a real application you don’t usually use a 47000 vertex model so speed between different systems in ‘my game’ is not as noticeable.

But, as I said, with Radeon 7500, 7200, … you don’t have any other way to send the geometry than using ‘standard’ arrays.
I tried the test with ATI_vertex_array_object but it locked the computer. (I have to try with newest drivers as it seems that they have fixed some problems I saw with previous ones)
I have to said that ATI drivers have been improved a lot from first Radeon. Now, not using CVA, for the first time, with latest drivers, everything is working ok on my system. It is time to give their extensions another try. I’m happy about it.

Question for you programmers of games and other kind of applications. Do you like a common (for all IHV) api calls to send geometry to the cards? Maybe OGL2 proposal as an extension for current OGL?
Question for IHV: Is that difficult to create a common way to solve this? Haven’t you read this forums with a lot of questions about ‘best way to send geometry’, ‘using CVA’, ‘Using display lists to send geometry’, ‘Using VAR’, and similars? Doesn’t it mean anything to you?
Thanks.

[This message has been edited by Zak McKrakem (edited 02-15-2002).]

I 100% agree with Zak. Yes, trust him. I have a Radeon 8500. Without using the VAO extension (just plain vertex arrays), the most i can get is 2 Millions Tris/sec. In D3D on the same system/hardware, i can reach up to 40 Millions Tris/sec.

With VAO i get better results ( up to 13 MTris/sec ) but it’s still pretty far from D3D’s peak.

Y.

ATI_vertex_array_object and ATI_map_object_buffer have recently been implemented for all Radeon family cards (including the 7500). It’s not in the current driver release, but it will appear in the next one. The only thing that the 7500 can not support is ATI_element_array because the HW does not support it.

–Dan

Originally posted by dginsburg:
ATI_vertex_array_object and ATI_map_object_buffer have recently been implemented for all Radeon family cards (including the 7500). It’s not in the current driver release, but it will appear in the next one. The only thing that the 7500 can not support is ATI_element_array because the HW does not support it.

Dan, it is good to hear that. If you let me to suggest an addition to the extension, then I will suggest you to include something like OGL2 Direct Access.
The extension is very similar to the Vertex Arrays Objects that appears in OGL2 white paper.
This way, for dynamic objects, you should not store the model in system memory before calling UpdateObjectBufferATI.
I can be wrong, but I think that AdquireDirectPointer is very similar to D3D’s Lock and ReleaseDirectPointer similar to D3D’s Unlock and as you have those functions implemented in your driver it could be easy to create the gl interface.
And, as this is not OGL2, you can relax the spec to meet your current hw requirements. It can be a good base for a desired ARB extension and a good bridge for future OGL2.0

Thank you.

Originally posted by Zak McKrakem:
Believe me. It is not a ‘real’ application because it draws the same model (with ~40000 faces, ~47000 vertex) in 16 different positions using just one texture with one directional light and infinite viewer.

The best way to feed a graphics card with static models is using display lists. And if you embed a glDrawElements vertex array inside a display list, even better: that way you hint the driver that:
a) The model is not going to change (it’s a display list).
b) It can draw the display list using indexed primitives (the driver could “guess” this without the glDrawElement hint, but just in case).

The driver will choose the fastest method to display that, be it AGP memory or even video memory.

As a driver developer said in OpenGL gamedev discussion list:

When you create a display list, I really get to go to town because i
assume you’re going to want to use your list more than once.

The best way to feed a graphics card with static models is using display lists.

Not true. According to nVidia, the fastest way to send vertex data, static or dynamic, is with VAR. VAR, even in AGP memory, beats their own display list code.

Originally posted by Korval:
Not true. According to nVidia, the fastest way to send vertex data, static or dynamic, is with VAR. VAR, even in AGP memory, beats their own display list code.

According to nvidia http://developer.nvidia.com/view.asp?IO=ogl_performance_faq

  1. Should I use display lists for static geometry?
    Yes, they are simple to use and the driver will choose the optimal way to transfer the data to the GPU.

And the best thing is that you will get the best from every driver, not only from Nvidia’s.

That performance FAQ is out of date; there’s another one which says VAR is faster. Display lists take no advantage of the GPU vertex cache; each vertex is sent, lit, and transformed as a separate entity, and the GL driver really can’t optimise it easily without slowing down the initial processing of the display list fairly significantly (well, it might do that, but it might not - I suspect not). It’s “theoretically” up to nearly 4x faster to draw something using VAR thanks to the cache.

Cas

Originally posted by cix>foo:
[b]That performance FAQ is out of date; there’s another one which says VAR is faster. Display lists take no advantage of the GPU vertex cache; each vertex is sent, lit, and transformed as a separate entity, and the GL driver really can’t optimise it easily without slowing down the initial processing of the display list fairly significantly (well, it might do that, but it might not - I suspect not). It’s “theoretically” up to nearly 4x faster to draw something using VAR thanks to the cache.

Cas [/b]

That’s why I suggested embedding a glDrawElements vertex array inside a display list: the driver doesn’t have to do any guesswork at all, it knows for sure it’s an indexed geometry with no state changes in the middle. If the driver cannot be bothered to optimize that, it’s another matter.

Anyway, I still think there’s much more behind the scenes of a display list that what people think.

[Edit: Ooops, I thought you were Cas as in Cass ]

[This message has been edited by evanGLizr (edited 02-15-2002).]

Naw, I’m just plain old Cas wot doesn’t know much relatively

Interesting idea about drawelements inside the display list but I suspect that this is such a rare path they haven’t bothered with it.

Cas

That’s why I suggested embedding a glDrawElements vertex array inside a display list: the driver doesn’t have to do any guesswork at all, it knows for sure it’s an indexed geometry with no state changes in the middle.

Any decent optimized display list can do the same thing for glBegin/End with glArrayElement.

And the faq you pointed me to specifically says that the fastest way to transfer geometry to the graphics chip on nVidia hardware is “DrawElements/DrawArrays Using wglAllocateMemoryNV(size,0,0,1)”, which means VAR in video memory. The second fastest is “DrawElements/DrawArrays Using wglAllocateMemoryNV(size,0,0,.5)”, which is VAR in AGP memory. The third is display lists.

And I seriously doubt that encapsulating the glDrawElements calls on VAR are going to be faster than calling them directly. Who knows what drivers have to do behind the scenes to make display lists work; it could take longer than a simple function call.