ARB meeting notes from Jun/Sept/Dec

Korval · January 3, 2005, 12:37am

But difference will be negliable.
Have any specific evidence of that? Once again, have you seen a demo that compares the best OpenGL can do in instancing situations to the best D3D can do?

The onus isn’t on us to provide a reason for this extension; we already have one (worst case, it does nothing. Best case, non-trivial performacne gains. Ergo worthwhile). The onus is on you to provide some specific evidence that shows how it would not be useful.

Not sure what you’re saying “not really” about, as the data passing the AGP is still the same regardless of pre-T&L cache utilization
I don’t think I realized that you could replicate vertex data from indices, thus avoiding replicating the actual vertex data and thus potentially blowing your cache.

The other is that instancing already screws up for the pre-T&L already as it requires two vertex streams.
It isn’t that bad (or bad at all). It simply requires a different kind of per-vertex fetch operation. It doesn’t screw up the cache unless the hardware has some horrible limitation.

The third reason I’m afraid involves some non-public information that I can’t disclose, which I believe is good argument why what you say isn’t the case in practice, but it’s hard of course to make this convincing without going into details.
Considering that you work for ATi, this means that it’s ATi’s problem, not a problem with the concept as a whole or hardware in general. They should have made a real graphics card this go around with real features, rather than a simple knockoff of the R300.

It has clogged up the API since all extensions typically are made to be orthogonal with immediate mode calls. There’s a good reason why both display lists and immediate mode were ditched in OpenGL ES.
And yet, it is immediate mode which gives OpenGL some semblence of instanced rendering.

I think there are other priorities that are more important right now.
Such as? Performance should always be priority #1. Just because ATi doesn’t see it as a priority doesn’t mean that it isn’t a priority. And, looking at what the ARB has cooking, it ain’t much. This is not a highly complex spec that requires 2 years (including a failed year) to make progress on. It is a spec of already known behavior that can, in good hardware, potentially improve performance.

Christian_SchA_ler · January 3, 2005, 5:47am

It is some time ago that I used OpenGL for game engine work, it was the time of the GeForce2 then.

To test the batch performance, I implemented a “batch test mode”, where, at the level of doing the glDrawElements() call, the primitive count wound be reduced to 1 when batch test mode was enabled. So in batch test mode each batch of a scene was rendered with 1 triangle.

When run at a sufficient small resolution (say, 320x240) so that pixel work is negligible, the batch test mode could tell whether the rendering speed was dependant on size of batches or not. If the rendering speed is not dependant on batch test mode, this is evidence that batch overhead is the bottleneck.

What I can tell is that batch overhead was not the bottleneck, with one particular scene having some 5500 batches per frame, and on ~1000000 rendered triangles, FPS did improve significantly in batch test mode (that’s 200 triangles per batch, on average).

I think the instancing API is not strictly needed in OpenGL for simple object instancing, like, placing batteries of trees or lanterns into the scene. glDrawElements() interleaved with glLoadMatrix() is already fast enough.

However, some points:

The added semantics of The Instancing API ™ may offer some new programming techniques previously difficult to communicate to the driver.

If you go extreme and view a single quad as an instanced object, like, in a particle system, clearly glDrawElements() isn’t going to cut it.

The CPU work consumed by each render call could be used for different purposes. In a game, there’s always to little CPU left. So, it may be true that the batch overhead is not bottleneck at 200 tri/batch (the CPU can submit batches faster as the card can process them), but we would like to get rid of the CPU consumption altogether

system · January 3, 2005, 3:23pm

Originally posted by Christian Schüler:
[b]To test the batch performance, I implemented a “batch test mode”, where, at the level of doing the glDrawElements() call, the primitive count wound be reduced to 1 when batch test mode was enabled. So in batch test mode each batch of a scene was rendered with 1 triangle.

When run at a sufficient small resolution (say, 320x240) so that pixel work is negligible, the batch test mode could tell whether the rendering speed was dependant on size of batches or not. If the rendering speed is not dependant on batch test mode, this is evidence that batch overhead is the bottleneck.
[/b]
Sorry, but where is the conclusion to this? Was glDrawElements() a bottleneck or not?
And since you say Gf2, why not use glDrawRangeElements()?

Christian_SchA_ler · January 3, 2005, 3:46pm

Sorry for the convoluted wording.

Actually, it was glDrawRangeElements().

No, glDrawRangeElements() wasn’t a bottleneck, drawing 5500 batches with 1 triangle each (on a 600 Mhz PC, with Detonator 44.something) resulted in frame rates of 30-40 Hz, and with geometry on it was much lower, so which means OpenGL was pushing at least 150k batches per second.

EDIT:
Looking back to the first page, the example posted 20000 instances @ 5 Hz which is just 100k batches per second. Maybe the drivers have become more batch unfriendly over the time? (More stuff to do…)

zed · January 3, 2005, 4:22pm

“(worst case, it does nothing. Best case, non-trivial performacne gains. Ergo worthwhile)”

but the thing is it doesnt do nothing (as in affect performance/stability),
(your answer to this 'but they can always choose a different path, ie ignore the instancing path ')
true (though having the driver make a choice is gonna have a slight impact)
the major problem though is the added complexity in the driver which leads to less stabilty(oportunities to optimize)
(your answer to this ‘but implementing it is trivial , they do it once + forget it’)
how often have u seen with new releases of drivers, old stuff that once worked becomes broken, adding instancing will create extra burden on the driver writers, leading to worse drivers. nobodies perfect.

where would u use it(instancing) in doom3?
where would u use it in 3dmax?

are u doing over 100million tris/sec now with your app?
ie youre not even realising the potential of what u have to play with, yet u want more!

comeon korval give it up

Korval · January 3, 2005, 4:55pm

the major problem though is the added complexity in the driver which leads to less stabilty
Then they shouldn’t implement the extension. Just like if a D3D implementation couldn’t handle instancing, then they don’t have to.

Equally importantly, it isn’t a difficult thing to implement. If the hardware supports it directly, then it is trivial. And, if it doesn’t, then don’t implement the extension or write a fairly short bit of code to convert one glDraw* call into many.

where would u use it(instancing) in doom3?
where would u use it in 3dmax?
Doom 3 is an indoor game. And, since this is not the actual Doom 3 (which is a finished game and therefore can’t use instancing), but instead a hypothetical Doom 3, I could still see uses for instancing. For example, imagine a cave. Now, imagine that the cave floor/walls are littered with rocks. Granted, it’d be the same rock, but with a different position/orientation, it would be difficult to realize this in a game enviromnent. It’s a step beyond mere bump mapping and into what, in movie terms, would be set decoration.

In theory, you could have the walls themselves made of nothing but instances of a repeatd material: bricks, planks of metal, etc. No need for bump mapping or the much slower displacment mapping; this is real live geometry on the walls, created via step-and-repeat. You could imagine a wall made up of more interesting geometry this way.

In an outdoors game, there are even more uses for instancing.

3D Max isn’t even a performance program. This is a performance extension, so you shouldn’t expect them to use it. They probably don’t use VBO’s either (since their vertex data changes rather frequently and in ways that most game applications would consider unusual).

are u doing over 100million tris/sec now with your app?
I certainly won’t be able to with my CPU taken up by sending a bunch of batches, rather than running the application.

MZ1 · January 3, 2005, 7:07pm

Update to the test results:

I suspect the driver optimized the display list by detecting shared vertices and turning submitted vertex data into indexed mesh. I think so because I noticed performance dropped (by 20-40%, in both paths) when I prevented welding of vertices by modyfying positions by small random value in individual triangles.

Instead of “20 triangles * 3 verts * (…)” there should be “12 verts * (…)”

So the “yield” is actually 5 times lower than it seemed. Of course, the fps numbers (and hence the relative speedup value) are unaffected by this.

zed · January 3, 2005, 11:28pm

Originally posted by Korval:
Then they shouldn’t implement the extension. Just like if a D3D implementation couldn’t handle instancing, then they don’t have to.
dammed if u do dammed if u dont

Equally importantly, it isn’t a difficult thing to implement. If the hardware supports it directly, then it is trivial. And, if it doesn’t, then don’t implement the extension or write a fairly short bit of code to convert one glDraw* call into many.
ive made an appointment for u to go and see nvidia/ati this saturday between 2-2.30pm, should be plenty of time for u to wip up a instancing implementation (they also mentioned if u finish early about perhaps u could throw together a render to texture implementation for them as well)

In theory, you could have the walls themselves made of nothing but instances of a repeatd material: bricks, planks of metal, etc. No need for bump mapping or the much slower displacment mapping; this is real live geometry on the walls, created via step-and-repeat. You could imagine a wall made up of more interesting geometry this way.
how many vertices do u have to use to emulate a bumpmapped brick, perhaps if instancing was 1000x quicker than an non instanced method u could do it, but mate, its not 1000x quicker not even close.

I certainly won’t be able to with my CPU taken up by sending a bunch of batches, rather than running the application.
im working with scenes which are in the 100,000s of vertices, the number of verts aint the bottleneck

Korval · January 4, 2005, 12:59am

BTW, Zed, feel free to use appropriate punctuation and sentence capitalization in your posts.

ive made an appointment for u to go and see nvidia/ati this saturday between 2-2.30pm, should be plenty of time for u to wip up a instancing implementation (they also mentioned if u finish early about perhaps u could throw together a render to texture implementation for them as well)
I presume you have secured their current driver codebase, as well as their hardware documentation and engineers so that I may have various questions answered. After all, without these resources (among others), noone would be capable of writing any kind of functioning OpenGL driver for their cards.

Oh, and FYI: both nVidia and ATi have implemented instanced rendering into their D3D drivers. If it wasn’t oppressively hard to put them there, it can’t be that hard to put them in their GL codebase with an appropriate API. It is the same hardware, after all.

how many vertices do u have to use to emulate a bumpmapped brick, perhaps if instancing was 1000x quicker than an non instanced method u could do it, but mate, its not 1000x quicker not even close.
The reason it isn’t done (and I’m not talking about a detailed brick; I’m talking about a relatively simplified brick pattern) is not due to hardware issues. It is simply because of the brutal memory costs associated with such hyper-detailed terrain. A wall that could have been 2 polys can quickly become 3,000 with such detail. That’s a massive increase in the size of the vertex data, and it can easily get out of control. However, if you build it out of instanced pieces, you save memory by only storing the location/orientation of the instances.

im working with scenes which are in the 100,000s of vertices, the number of verts aint the bottleneck
Vertex processing, of course, isn’t the bottleneck in question: vertex upload and CPU processing is the bottleneck that instancing is designed to mitigate.

system · January 4, 2005, 4:35am

(not wanting to interrupt the instancing debate) Why don’t they post the EXT_framebuffer_object in the registry? If it is ready, is there any reason not to post it?

imported_bobvodka · January 4, 2005, 6:39am

intresting point, probably because of Xmas and maybe wanting to give the IHVs time to get a driver sorted with it in before letting us see it…

knackered · January 4, 2005, 7:12am

Originally posted by Korval:
BTW, Zed, feel free to use appropriate punctuation and sentence capitalization in your posts.

People seem to be parallised with fear of anyone adding anything to OpenGL at the moment, for fear of the nvidia/ati driver writers being unable to cope with the extra complexity. Is this a justified fear? Maybe with ATI it is, but NVidia are pretty savvy.
I don’t see the problem - like Korval says, it’s in d3d now: and I doubt very much that vertex streams were introduced just because of d3d’s drawprimitive call penalty…it’s a pretty dramatic change to the mechanism in d3d, a lot of work put in to something that is obviously going to be useful. It doesn’t need to be a major change in OpenGL because of the nice way vertex arrays are handled already.
Also, I keep hearing this talk of the driver having yet another state to consider when issuing draws, but surely this is outweighed by the fact that when instancing it has far less states to consider because of thousands of draw calls condensed into one.
Let me have this feature please.

marco_dup1 · January 4, 2005, 7:34am

maybe there will be a feature in the near future that will be making this kind of instancing obsolet. a much more general mechanismn, for example vertex generation, killing in the vertex shader.

knackered · January 4, 2005, 7:51am

I seriously doubt it. In any case, how would that be useful for instancing?

yannoo · January 4, 2005, 8:42am

>I seriously doubt it. In any case, how would >that be useful for instancing?

This can be usefull for a lot of things such as fractal terrain generation for example (cf. one big quad can become a lot of smalls quads/triangles for “simulate” bump mapping).

Ok in this case we don’t destroy vertices but add a lot of anothers vertices, but the idea is the same (the number of input and output vertices aren’t the same …).

It’s seem me that ATI have already make something like this with something named TrueForm or something like this.

@+
Cyclone

Korval · January 4, 2005, 9:27am

a much more general mechanismn, for example vertex generation, killing in the vertex shader.
It would be better to have an entire new programmable mode (aka, a primitive processor), rather than overloading vertex shaders. A decent primitive processor needs to do lots of memory accesses in order to do truly useful stuff. Plus, it helps with pipelining, as primitive processing can happen completely in parallel with vertex shading.

As to your point, yes, a primitive processor can do instancing. However, it will likely be slower than a hardware-based solution.

It’s seem me that ATI have already make something like this with something named TrueForm or something like this.
TrueForm was just a tesselation and mesh smoothing mechanism ATi created, much like GeForce 3/4 hardware had some form of polynomial surface generation. Both of these were very hard-wired and non-trivially restrictive. I’m pretty sure that later hardware (R300+ and NV30+) doesn’t even have these features, though I may be mistaken. At the very least, nobody seems terribly interested in using them.

system · January 4, 2005, 9:30am

Originally posted by cyclone:

This can be usefull for a lot of things such as fractal terrain generation for example (cf. one big quad can become a lot of smalls quads/triangles for “simulate” bump mapping).

That’s a tesselation engine and truform was one. Truform was also a big failure and brings down performance by half. NVidia had evaluators which was also dropped.

There is talk that DX10 will support a programmable tesselator, but it’s just rumours.

Instancing is about reducing API calls. You render the same thing except you use another stream of data to replace all the color or all the normals or all the texcoords in another VBO. Of course, in a shader, use them however you want.

The ARB meeting notes give an example of what is instancing-like, but there are other ways, like have separate arrays for normals and texcoords.

I think that a lot of the ARB members are reluctant to add features, because it causes an explosion of complexity. Instancing was refused with one short sentence

At least GL ES is cleaned up. They want to eliminate FF altogether in GL ES 2.0

marco_dup1 · January 4, 2005, 9:57am

Originally posted by Korval:

As to your point, yes, a primitive processor can do instancing. However, it will likely be slower than a hardware-based solution.

Yes, hard wired solution will mostly faster but I doubt that it’s so econonical to have hardwired instancing because its so specialised. How often do you need instancing? Maybe for some games(this boring one without a new idea and very good graphics). Maybe somewhere a people who have some new ideas what to do with this new possibilities. But I’m pessimistic because life is so much more interesting than games, you have allways a real risk

knackered · January 4, 2005, 10:38am

Trees, grass, clouds, rocks.
Not teapots, granted - but most serious users of OpenGL are interested in more than buggering around with bumpmapped teapots and rabbits.
I refer you once again to the arguments for procedural textures and geometry. It’s the same argument for instanced geometry.

MikeC · January 4, 2005, 12:51pm

Originally posted by V-man:
At least GL ES is cleaned up. They want to eliminate FF altogether in GL ES 2.0
Interesting. I hadn’t heard that.

I haven’t been paying much attention to OpenGL-ES, but if backward-compatibility cruft is becoming the hindrance that several posts in this thread seem to indicate, I wonder whether it might one day drop the “Embedded” and become what was once mooted as OpenGL 2.0 Pure.