ARB meeting notes from Jun/Sept/Dec

Why are you saying that state changes cause a stall in the pipeline?

Originally posted by knackered:
Why are you saying that state changes cause a stall in the pipeline?
State changes trigger reconfiguration. “Work” OTOH doesn’t.
If you reconfigure a pipelined hardware device, it’s very hard to maintain coherency without flushing.

I agree with Korval insofar as this is likely the cause of the performance difference between attributes (work) and uniforms (state).


Korval,
I think I’ve seen evidence that both ATI and NVIDIA don’t need to alter the geometry. Immediate mode attributes (that aren’t updated from array data) can just stick at the vertex fetch stage. No need for replicating them.

Originally posted by knackered:
There’s quite an overhead in calling glDraw* for 5000 beer cans, irrespective of API. It would certainly push an unnecessarily large stream of commands into the pipeline, when a single command would do.
It’s not nearly as dramatic in OpenGL as in D3D since you don’t have the context switch to ring0 and back to ring3 again for each draw call. I’m doubtful instancing will ever be particularly useful on the OpenGL side. It’s hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it’s not really needed. 5000 draw calls btw isn’t going to stop you from running at smooth framerates. Now if you really need to draw 20000+ objects of the same kind you can still use the shader constant instancing method where you pack the instance data into uniforms and look it up in the vertex shader. It is pretty much equally good, sometimes even faster.

It’s not nearly as dramatic in OpenGL as in D3D since you don’t have the context switch to ring0 and back to ring3 again for each draw call. I’m doubtful instancing will ever be particularly useful on the OpenGL side.
What is the particular basis for your doubt of the usefulness of this technique? Have you seen a program that actually compares instancing in D3D to non-instanced draw calls in GL? Under both nVidia and ATi hardware?

20,000 draw calls may not be 20,000 switches to Ring0 and back, but it isn’t cheap.

It’s hard enough to motivate on the D3D side. If your object has > 100 triangles, the bottleneck has already moved to the vertex shader even with a simple shader. Perhaps future hardware will change the balance and an instancing feature can be considered then, but today it’s not really needed.
There are plenty of objects that we would like to draw 10,000 of that are less than 100 triangles in size. Tufts of grass, for example. A massive forest (of very simple trees). Or a field of fragment-based imposters.

Plus, if it is going to be needed in the future, what is the harm of adding it today? Indeed, considering the general slowness of the ARB, starting the process today would be a good idea. Worst case, it becomes a feature of GL that nobody uses; GL has plenty of those, so nobody will really notice. We’re talking about 1 or 2 entrypoints here, not a massive change to how vertex arrays work.

Now if you really need to draw 20000+ objects of the same kind you can still use the shader constant instancing method where you pack the instance data into uniforms and look it up in the vertex shader. It is pretty much equally good, sometimes even faster.
There aren’t anywhere near 20,000 uniforms, so you’re going to need a lot of state changes. And, as we’ve demonstrated here, the state change penalty for switching out uniforms is hardly trivial.

<edit: missed this from before>

I think I’ve seen evidence that both ATI and NVIDIA don’t need to alter the geometry. Immediate mode attributes (that aren’t updated from array data) can just stick at the vertex fetch stage. No need for replicating them.
Really? That’s pretty nifty if it does work that way. I think the best way to verify it is to render with the set attribute and then render the same scene with that attribute bound to a vertex buffer (replicated with the same value). If both methods are equally fast, or the vertex buffer method is slightly slower, then it means that their hardware does allow updating attributes directly through immediate mode. While it doesn’t eliminate the draw call overhead, it does mean that the only thing direct instancing would buy you is losing this overhead.

Originally posted by Korval:
[QB]What is the particular basis for your doubt of the usefulness of this technique? Have you seen a program that actually compares instancing in D3D to non-instanced draw calls in GL? Under both nVidia and ATi hardware?
I have not compared to non-instanced calls in GL, but I’ve done quite a bit of work on instanced vs. non-instanced on the DX side, both at home and at work. The first time it took me quite a lot of effort and even some help by one of our driver guys to even get it to run faster than the non-instanced path. And our hardware still see larger benefits of using instancing than nVidia’s, at least the last time I checked.

Plus, if it is going to be needed in the future, what is the harm of adding it today?
We don’t really need more garbage hanging around in the API. If it’s going to be added, it should be proved to be useful first.

There aren’t anywhere near 20,000 uniforms, so you’re going to need a lot of state changes. And, as we’ve demonstrated here, the state change penalty for switching out uniforms is hardly trivial.
If you have 20,000 instances, already by packing two instances per batch you’re down to 10,000. With 4 you’re at 5,000 calls etc. It doesn’t take many instances per batch to cut down the number of calls so that the bottleneck ends up elsewhere. Depending on how much instance data you have you’ll probably be able to pack at 30-60 instances in a batch, getting the number of draw calls down 300-600 for 20,000 instances. In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.

Originally posted by Humus:
We don’t really need more garbage hanging around in the API. If it’s going to be added, it should be proved to be useful first.

It needs to be added before it can be tested and proved.

:slight_smile:

I think there was a feature that was previously discussed here, that should be in WGF 1 or 2. The multiple indice streams thing (i don’t really remenber the exact name). Having an extension to do this would cover (?) the needs of geometry instancing maybe. Furthemore it is backward compatible, since the driver would do pretty much the same job as humus describes above to render everything. On newer hardware it would even save memory bandwidth. What do you think about it?

Originally posted by Humus:
[quote]There aren’t anywhere near 20,000 uniforms, so you’re going to need a lot of state changes. And, as we’ve demonstrated here, the state change penalty for switching out uniforms is hardly trivial.
If you have 20,000 instances, already by packing two instances per batch you’re down to 10,000. With 4 you’re at 5,000 calls etc. It doesn’t take many instances per batch to cut down the number of calls so that the bottleneck ends up elsewhere. Depending on how much instance data you have you’ll probably be able to pack at 30-60 instances in a batch, getting the number of draw calls down 300-600 for 20,000 instances. In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.
[/QUOTE]I don’t get this - you seem to be using the word ‘instancing’ in the wrong context. If you’re talking about packing 30-60 ‘instances’ into a batch then you’re talking about replicating vertices to work-around the absense of an instancing mechanism…which is using loads more memory, and memory bandwidth, which is the exact problem instancing attempts to address. Scale your example up, and you’re using a significant resource. So you’re arguing against the very idea of instancing? Seems odd, when everyone else in the industry seems to be pushing for things like precedural textures and geometry in order to address the increasing detail-versus-memory-constraints problem, you seem to be against instancing which would help greatly in this area.

I’m not against the idea. If it proves useful in GL I’m all for it. I’m just saying I’m not so sure this is the case.

As for the instancing method I described, the idea of instancing is to be able to draw many instances with one draw call as the main bottleneck is considered to be the actual draw calls (in DX anyway). This method solves that problem equally much as “real instancing”. Yes, your VBO needs to contain several copies of the model, but since we’re talking about < 100 triangle models, it will still be very small. Say vertex + normal + texcoord, 80 vertices and 50 copies, that’s only 125Kb. Hardly a problematic resource usage. The data that needs to pass the AGP bus every frame is also the same as in real instancing.

In that case, the bottleneck has long shifted over to the vertex shader, and the cost of draw calls and uploading uniforms are totally hidden.
But it still isn’t as fast as the truely instanced case.

The data that needs to pass the AGP bus every frame is also the same as in real instancing.
Not really. The reason that models of greater than 100 triangles (or thereabouts) are not faster with instancing is because these models blow the pre-T&L cache. If an instance fits into the pre-T&L cache entirely, then there’s no problem. Every call after the first will not provoke a hit on memory, save for the index load.

If you do your instancing mechanism, the likelihood is that you’ll blow the pre-T&L cache, and every vertex (depending on intra-instance sharing) will provoke a memory access.

Korval,

You’re talking about Pre-T&L cache, would you have any hint/links to explain what it is further? I’m interrested… How do you know its size? Is it avial on all 3d accelerators?

regards,

Hi

Only because OpenGL already is faster (in that point), doesn´t justify to not make it any faster anymore.

This is really a lame excuse and such an attitude will hurt OpenGL in the long run. I dare to say, that it already did for several years.

I am pretty sure, that there are lots of apps, that might benefit from instancing, even if the bottleneck it tackles, might not be the most important.

And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??

Jan.

Originally posted by Jan:
[b]Hi

Only because OpenGL already is faster (in that point), doesn´t justify to not make it any faster anymore.

This is really a lame excuse and such an attitude will hurt OpenGL in the long run. I dare to say, that it already did for several years.

I am pretty sure, that there are lots of apps, that might benefit from instancing, even if the bottleneck it tackles, might not be the most important.

And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??

Jan.[/b]
speaking as someone who knows nothing about the internal workings of drivers but
A/ im not to sure a LOT of apps are gonna benifit from it, thus its only gonna benifit some ppl. eg how will it benifit doom3/maya?
B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )

You’re talking about Pre-T&L cache, would you have any hint/links to explain what it is further? I’m interrested… How do you know its size? Is it avial on all 3d accelerators?
It’s a memory cache, just like any other regular memory cache. It works like the one in your CPU. If a vertex index (when converted into one or more actual memory addresses) would provoke the fetching from a memory location that is already in the cache, then it doesn’t fetch from memory. Just like the one in your CPU. It’s just an L1 (or maybe an L2, depending on hardware) cache strapped to the vertex reading apparatus of the hardware.

And it seems not to be THAT hard to implement it, so why do we have to fight for it that hard??
Look how long it took to get RTT (and, despite any notes to the contrary, we don’t have it yet).

A/ im not to sure a LOT of apps are gonna benifit from it, thus its only gonna benifit some ppl. eg how will it benifit doom3/maya?
Bad excuse: no new extensions benifit already existing applications. Glslang doesn’t benifit Doom3 or the current version of Maya either; that doesn’t mean we shouldn’t have it.

B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )
Considering that ATi is perfectly capable of optimizing it in D3D for hardware that doesn’t even support instancing directly (R420), I don’t think it puts an undo burden on driver writers. Plus, its an extension; as such, it’s not required. We’re not asking it to be included in the core, or to even be an ARB extension. Just get both ATi and nVidia to agree on it, so that those two can implement it.

B/ nothing is free, adding this will make the driver more complicated (hence more prone to bugs, less oportunities to be optimized, i know if i had a choice between making everything go slightly faster or making a specilised go a lot faster, i know which ild choose )
It’s existance would provide an opportunity for optimisations, that’s the whole point…not having an explicit mechanism makes it virtually impossible to optimise for instancing. You need to be able to say to the driver “I’m instancing”. If the driver doesn’t want to optimise instancing, it can just drop to a slow path. That’s the nice thing about OpenGL…the matrix stack has always been part of the core, display lists have always been part of the core…if a driver had no way of optimising display lists, it would just behave like immmediate mode, but at least you had the opportunity to say “This stuff is static”, just in case a future driver could do something with that valuable information. glDrawRangeElements is another one…some drivers may ignore the extra information, but most drivers are grateful for it.

Bad excuse: no new extensions benifit already existing applications. Glslang doesn’t benifit Doom3 or the current version of Maya either; that doesn’t mean we shouldn’t have it.
u misunderstand me, im using a couple of examples of apps that (if instancing existed before they were even dreampt about) they would most likely not use, eg where would u use instancing in doom3?

B/ displaylists + rangeelements are a different they are more generic, instancing is benificial only to limited data sets (eg blades of grass), i liken instancing to point sprites, as waste of time (ok there are some limited cases where they are benificial) but personally its just added clutter to the api, i would prefer a lean mean api, instead of a slow bulky api that does everything.

Of course it would provide an opportunity for optimisation, and in the long run adding an abstraction for instancing to OpenGL might be useful, but in the short run I know tons of other things I’d rather see worked on over this.

The draw calls in OpenGL are really light weight as it is, FAR more than they are in D3D and I find being batch limited quite rare in OpenGL with sane engine design.

If one really wanted to, instancing might be achieved by a modification to MultiDrawArrays.

Originally posted by Korval:
But it still isn’t as fast as the truely instanced case.
But difference will be negliable.

Not really. The reason that models of greater than 100 triangles (or thereabouts) are not faster with instancing is because these models blow the pre-T&L cache. If an instance fits into the pre-T&L cache entirely, then there’s no problem. Every call after the first will not provoke a hit on memory, save for the index load.

If you do your instancing mechanism, the likelihood is that you’ll blow the pre-T&L cache, and every vertex (depending on intra-instance sharing) will provoke a memory access.
Not sure what you’re saying “not really” about, as the data passing the AGP is still the same regardless of pre-T&L cache utilization, but anyway, I understand your argument, but I’m not sure I agree for three reasons. The first is that in general memory access is seldom the bottleneck anyway, so it usually doesn’t matter that much. The other is that instancing already screws up for the pre-T&L already as it requires two vertex streams. The third reason I’m afraid involves some non-public information that I can’t disclose, which I believe is good argument why what you say isn’t the case in practice, but it’s hard of course to make this convincing without going into details.

Originally posted by Jan:
Only because OpenGL already is faster (in that point), doesn´t justify to not make it any faster anymore.
If it doesn’t make any difference in practice, then that’s certainly a good reason not to include it. Everything you do has some overhead. The question is, is it significant enough to motivate additional functionality to dodge it, particularly given the likelyhood it will actually be used by common applications? glDrawElements() has its overhead, but so does glEnable(), glBindTexture() etc. There are probably some usage scenarios where putting texture changes directly into the index stream to glDrawElements() so that several meshes could be drawn with a single draw call given some hardware support, but is it worth the effort and the API pollution? That’s the question.

Originally posted by knackered:
It’s existance would provide an opportunity for optimisations, that’s the whole point…not having an explicit mechanism makes it virtually impossible to optimise for instancing. You need to be able to say to the driver “I’m instancing”. If the driver doesn’t want to optimise instancing, it can just drop to a slow path. That’s the nice thing about OpenGL…the matrix stack has always been part of the core, display lists have always been part of the core…if a driver had no way of optimising display lists, it would just behave like immmediate mode, but at least you had the opportunity to say “This stuff is static”, just in case a future driver could do something with that valuable information. glDrawRangeElements is another one…some drivers may ignore the extra information, but most drivers are grateful for it.
Well, you mention display lists, well, instancing in GL runs the risk of becoming another of those “seemed like a good idea when it was added, but turned out to be a mess in the long run” kind of features. Display lists takes up a whole lot of code, and slows down the whole API actually since pretty much each and every call has to check whether we’re currently collecting a display list or rendering as usual, and there are better ways to deal with the what it’s mainly used for namely storing geometry, like VBOs. Same with immediate mode. It has clogged up the API since all extensions typically are made to be orthogonal with immediate mode calls. There’s a good reason why both display lists and immediate mode were ditched in OpenGL ES.
As much as I love new features, it’s hard to get excited about instancing. Especially in OpenGL were glDrawElements() is so lightweight already. I think there are other priorities that are more important right now.