Display Lists - The Next Generation (CBO)

I don’t understand one thing: why VBOs used with bindless must be made resident every time we change their content? The layout of the buffer does not change.

Because that’s the whole point. The only way you can resolve a buffer object into a GPU address is to prevent the buffer from being moved around.

Anyway, I think that the major advantage of CBO over DL is that it decouples commands from the data.

Yes, but a lot of the performance benefit of Display Lists comes from the freedom to optimize the data format.

sorry aleksandar, I think you’re talking a lot of sense, but I’m absolutely sure that’s what nvidia do (referring to the first bit of your reply). Maybe this is a quadro-only thing, I’m not too sure as I rarely run on geforces these days. Render the same thing for 4 frames (well, it depends on the complexity of the scene - 4 frames for a very heavy scene) and your frame rate goes up significantly. This is because they are optimising the frame in the background, and then upload the optimised data.
None of this would be possible with CBO’s. Dynamic data is a completely different topic, as far as I’m concerned. Not a particularly common thing either…do you do a lot of CPU vertex work each frame? I don’t. I barely touch my geometry after I’ve created the GL resources. I stream data in big blocks, which I have to stagger over a number of frames so I don’t drop one. If I could do this in another thread I’d be completely happy. I would literally consider the problem solved. Yay for the big block box that is display lists.

Would be interested in seeing this (and trying the code here). However note that since bindless optimizes CPU/CPU-mem-limited issues, the CPU/CPU-mem should be more relevant for good bindless test cases than GPU, so definitely cite that too. In the limit (large batches), I haven’t seen bindless cost you anything over static VBOs though.

A very interesting phenomenon is that Bindless is slower for small scenes than ordinary VBOs, and have a jump for the middle-range scenes.

I’d be “very” interested in more details on this. I have never seen this. You are saying that bind-by-handle was faster than bind-by-address? And just to be clear, are we only talking about bindless for vtx attrs and index data (i.e. NV_vertex_buffer_unified_memory), not shader data? And static, unchanging VBOs? Interleaved attrs? Not repeatedly doing the buffer addr query and make resident (glGetBufferParameterui64vNV / glMakeBufferResidentNV)?

(Forgot those two in my list above. There are actual 4 APIs relevant to new-style vtx attribs batches, not 2.)

It’d be cool if we could collectively pull a test pgm together to illustrate it, and pass around to try on different CPU/CPUmem/GPU combos/driver to verify and get a better feel for when this oddity occurs.

Lets go back to first principles and see where it takes us.

The heart of the rendering loop is something like this:

For each Material
  Select Shaders, UBO, Textures, Samplers etc. to use;
  For each Object
    Load transformation Matrix;
    Render triangles from VBO;
  Next Object
Next Material

The setup for each material (Skin, cloth, metal, wood etc.) currently involves several calls to select a shader program for the material, which textures it uses, and possibly a UBO.
However, once a material is defined it doesn’t change, so this would be better done with an immutable material object.
The pipeline state for this object will be pre-validated when its created, so switching materials will be simpler and faster.
(This is very similar to the Longs Peak ‘Program Object’)

When switching between materials, some state changes are more expensive than others, so the material rendering order should be sorted to minimise rendering time by grouping together materials that share that slow-to-change state.

Now we could run our own tests to find the most expensive state changes, but that could change in future hardware or differ between vendors.
Hence it would be better if the driver decided the rendering order, sorting the materials optimally whenever we add a new Material Object.

But if the driver is controlling the rendering order then it needs to know which VBO’s contain objects of each material.
So lets add an Object Buffer Object that contains a list of object records which contain the position and rotation of an object as a transformation matrix, plus the VBO, offset, size etc. of the actual triangle data.
Then give each Material Object a linked-list of the Objects that are made from that particular material.
This has the added benefit that the transformation matrix and VBO can be updated by an OpenCL physics engine without being shuffled to the CPU and back.
(I chose a linked-list for objects so that objects can be added and removed from the scene without expensive searching or data re-packing)

To make a depth pre-pass more efficient we could use a separate linked-list of objects sorted in front-to-back order.

Now the GPU has all the information it needs to render the main scene in a single API call. No cache-misses, no table lookups, no Draw calls, and very few API calls.

But we still have one big problem, the lag between the CPU and GPU.
The API was designed for a CPU that directly controlled a graphics peripheral, but with modern hardware the GPU will stall if we try to control the render from information read back from it, forcing us to use out-of-date information from the previous frame to control the current frame.
With commands, display lists, or even CBO’s, we are limited to a linear sequence of commands similar to an old DOS batch file.
Conditional rendering was added in recognition of this problem, but it only provides a very basic if-then branch for occlusion queries.

Display lists are said to be ‘Compiled’ to run efficiently on the GPU command processor, but why limit this to such simple programs, why not go all the way and have a ‘Command Shader’.

This would be compiled in the same way as a GLSL program, but would have a single instance that would automatically run on the GPU command processor after each buffer swap.
Its purpose would be to move the main rendering loop from the CPU to the GPU, removing all lag, allowing more complex control of rendering, improving speed, and reducing CPU workload.
It would also allow proper synchronisation between OpenCL and OpenGL by directly scheduling OpenCL kernals to be run when the rendering has completed and OpenGL is waiting for the buffer swap.
The CPU would now be responsible for changes to the game world, adding and removing objects, moving the camera and animating creature movements, while the GPU does all the repetitive processing that is the same every frame.

He’s a man of big ideas. Sounds good to me. By friday, please.

I am not at all convinced that adding more objects is going to solve the problem.

Also, relying on the driver to optimize things like rendering order and so forth is, well, look at ATI’s display lists. If they don’t want to optimize this stuff, why should that inhibit your performance?

No cache-misses, no table lookups, no Draw calls, and very few API calls.

No cache misses? What, is this stuff somehow magically preloaded into the cache? What exactly do you expect the driver to be doing behind the scene when you say, “execute this rendering list?”

It’s going to have to read the array that stores those objects. Cache miss, for every N indices).

It then has to reference this pointer and read the data for that object. Cache miss, every time.

It then has to read the various other objects (VAOs, programs, textures, etc) used by that object. Cache miss, cache miss, cache miss.

Putting the traversal of the scene graph on the driver does not magically make the problem go away. It’s still there; it’s just hidden from you.

It’s much better to give the programmer more ability to optimize rather than forcing it on the driver.

why limit this to such simple programs, why not go all the way and have a ‘Command Shader’.

Because shaders run on the GPU. They can only read GPU memory.

The only way to do what you’re suggesting is to make a CPU thread that executes “shader” code compiled to CPU assembly.

Also, even if you could make the GPU do this, GPUs are not actually good at this stuff. They have caches too as well as cache misses, and their caches are optimized for graphics work, not general programming work (which is what you’re asking for). Building the rendering command list is not a highly parallel activity like actual shaders. It’s something best left to a CPU thread.

Plus, every GPU cycle spent on traversing the scene graph is a cycle lost to your actual rendering.

Lastly, it doesn’t even do what you want. Even if the GPU could build the command list, it wouldn’t “removing all lag.” This “command shader” would have to wait to do readback, just like the CPU. It would have to sit there and wait until the GPU has completed the operation before it could effectively do readback.

So there’s really no difference, except that you’re wasting precious GPU resources on a task that the GPU is highly unsuited to doing.

true, and if it were practical to do it on the GPU, then display lists would be executed entirely on the GPU - but they’re not.

I mean no cache misses on the CPU, because this would be run entirely on the GPU command processor.
The CPU would setup the VBO’s, Material objects and Object data, then the GPU would render frames repeatadly.
The application/driver on the CPU would only be involved with changes to the game world.
Cache misses on the GPU can be avoided by the driver inserting prefetch instructions in the command stream, just as it does with compiled display lists.
Materials and objects are much smaller and accessed much less often than the actual VBO data, they are optimised when created just like a display list is, and internally they would use GPU addresses not names, so cache misses would have a minor impact anyway.

Because shaders run on the GPU. They can only read GPU memory.
Which is why we put all the data it needs into the GPU memory first.

Also, even if you could make the GPU do this, GPUs are not actually good at this stuff. They have caches too as well as cache misses, and their caches are optimized for graphics work, not general programming work (which is what you’re asking for). Building the rendering command list is not a highly parallel activity like actual shaders. It’s something best left to a CPU thread.

Fermi does have a cache very like a CPU cache.
The command shader is not meant to run on either the CPU or the shader processors, it is meant to run on the command processor.
This is a separate processor in the GPU that receives a block of rendering commands from the CPU (or a compiled display list) and executes them, while managing the distribution of work threads to all of the shader processors.

Lastly, it doesn’t even do what you want. Even if the GPU could build the command list, it wouldn’t “removing all lag.” This “command shader” would have to wait to do readback, just like the CPU. It would have to sit there and wait until the GPU has completed the operation before it could effectively do readback.

A command shader does not ‘build’ a rendering command list, it replaces the rendering command list.
Yes, the command shader has to wait for the shader processors to finish a specific task before it can test and branch, but this response is almost instantanious and nothing like a CPU/GPU synchronisation.
Inserting another task between the operation and the test for its result can ensure that the shader processors are kept busy.
Conditional rendering already does exactly this for occlusion query results.

Cache misses on the GPU can be avoided by the driver inserting prefetch instructions in the command stream, just as it does with compiled display lists.

There is no “command stream,” because your whole idea revolves around the specific removal of commands. Instead, you just have a “shader”. If you have a scene graph, with descriptions of how to render the various things, this will not be located in contiguous memory. Prefetching doesn’t help, because you’re essentially accessing data at random.

Furthermore, any such prefetching could just as easily be done on the CPU.

Materials and objects are much smaller and accessed much less often than the actual VBO data, they are optimised when created just like a display list is, and internally they would use GPU addresses not names, so cache misses would have a minor impact anyway.

Do you think that buffer objects don’t use GPU addresses internally?

In terms of cache behavior, what matters is the access pattern. The scene graph, at best, would be an array of pointers to objects. These objects would, essentially, be random accesses. And since you only access these objects once (or relatively few times, at any rate) per frame, you get terrible cache behavior.

The problem isn’t what those objects store internally; the problem is that it takes two fetch operations. The first fetch is to get the pointer to the object itself. The second is to reference the pointer to get the object’s data.

Bindless is designed to do an end-run around this.

Which is why we put all the data it needs into the GPU memory first.

Which will require CPU/GPU synchronization in order for the CPU to modify to that memory. After all, you can’t go changing all of these heavyweight objects before the rendering on them has been done from last frame.

The command shader is not meant to run on either the CPU or the shader processors, it is meant to run on the command processor.

That’s not going to happen.

GPU hardware makers have spent a great deal of effort over the past years unifying their shader architecture, so that vertex, fragment, geometry and whatever all run on the same hardware. They’re not going to make a completely new shader stage with its own specialized logic hardware just for something you could do yourself.

why is your scenegraph not contiguous in memory?

All i am doing is shifting the scenegraph access from the CPU to the GPU, either way you will be randomly accessing objects in memory once per frame to get its transformation matrix and VBO address.
If done on the CPU, then after reading the object (CPU cache miss can be avoided by prefetch) we first need to call the driver to put the matrix into a UBO, then call it again to draw from the VBO (both of which execute a LOT of instructions).
These commands get assembled into a buffer on the client side which sometime later get flushed to the server side and passed to the GPU which then starts reading vertices and scheduling vertex shader runs (with a GPU cache miss on first access to VBO).

If done on the GPU, then we copy object transformation matrix directly to UBO, read the address of the VBO, then immediately start reading vertices. MUCH less work overall.

There is potentially a cache miss on the object read and the first VBO access, but there are ways around this.
The loop that itterates through the objects can prefetch the VBO address a loop ahead, and the next object address a loop before that.
In many cases the entire object buffer would be small enough to remain in the 768 KB L2 cache anyway.

Bindless is designed to do an end-run around this.

Bindless only saves one table lookup, and hence one cache miss, per command, in the CPU case.
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.

Which will require CPU/GPU synchronization in order for the CPU to modify to that memory. After all, you can’t go changing all of these heavyweight objects before the rendering on them has been done from last frame.

No synchronisation is required, commands to alter objects will simply be queued and executed between frames.
VBO’s and other buffers will use the same ping-pong and orphaning techniques we use now.

GPU hardware makers have spent a great deal of effort over the past years unifying their shader architecture, so that vertex, fragment, geometry and whatever all run on the same hardware. They’re not going to make a completely new shader stage with its own specialized logic hardware just for something you could do yourself.
This is complete nonsense, in the fermi block diagram i can see texture units, vertex fetch units, tesselators, viewport transform units, attribute setup units, stream output units, rasterisation engines, memory controllers and a gigathread engine.
Some of these will be dedicated hardware, but the gigathread engine not only executes the commands sent from the CPU, it also “creates and dispatches thread blocks to various SMs” and has to do load balancing between all of the SM’s, so is likely to be quite a capable processor.
“Individual SMs in turn schedule warps to CUDA cores and other execution units” so there could be other processors there as well.

I am not talking about adding a new stage, i am talking about using the existing NVIDIA gigathread engine or AMD command processor in a slightly different way.
There may be limitations that prevent them executing arbitrary code in current GPU’s, but only slight modifications would enable next generation GL5 hardware to run command shaders.

I would at least like to see Immutable Material objects added on July 25th (Siggraph), after all, this is basically the same as was proposed for Longs Peak back in 2007.
A query that asks the driver for a list giving the most efficient ordering of the material state changes would be nice too…
The Command shader may have to wait for GL5 hardware though.

true, and if it were practical to do it on the GPU, then display lists would be executed entirely on the GPU - but they’re not.
I find that quite surprising, do you have evidence for this? I would have expected the more modern GPU’s to keep a compiled display list in GPU memory and directly execute it.

why is your scenegraph not contiguous in memory?

In my case i have an entire planet as my game world, so most of it stays on the hard disk most of the time.
I am continuously streaming objects in and out of the scenegraph as the player moves towards them or away from them. (Not to mention the different Level-Of-Detail versions that each object has).
This causes a lot of memory fragmentation, though i do try to keep it as compacted as possible.

If done on the CPU, then after reading the object (CPU cache miss can be avoided by prefetch) we first need to call the driver to put the matrix into a UBO, then call it again to draw from the VBO (both of which execute a LOT of instructions).

First, how can you avoid that cache miss? You don’t know what object you’re going to be reading next until you reference the pointer.

Second, none of what you’re talking about involves the execution of a “LOT of instructions”. It involves a lot of work, due to the synchronization needed in updating a buffer object’s contents. But this is not a lot of instructions.

And the GPU version needs to do that synchronization too.

If done on the GPU, then we copy object transformation matrix directly to UBO, read the address of the VBO, then immediately start reading vertices. MUCH less work overall.

And where does this object transformation matrix come from? The GPU isn’t allowed to read arbitrary CPU data, so it must be coming from a buffer object or the parameter of some other object. Which the CPU must set. This requires CPU/GPU synchronization.

In many cases the entire object buffer would be small enough to remain in the 768 KB L2 cache anyway.

You’re making an assumption that the quantity of data used by the shaders is rather small. I imagine that the state graph for scenes of significance exceeds 1MB. At least, the ones that are state-change or drawing call bound, rather than shader bound.

Also, where are you coming up with this 768KB L2 cache from?

In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.

Not if you’re doing what you’re talking about. So long as those buffer objects can be affected by the CPU and you expect the results of those changes to be reflected in rendering, the GPU-based scene graph code must still be using the buffer object’s name. Buffer objects can be moved around by the creation/destruction of other memory objects.

The reason bindless works is because of the MakeResident call, which explicitly forbids the implementation from changing or moving the buffer object’s location in memory.

in the fermi block diagram i can see texture units, vertex fetch units, tesselators, viewport transform units, attribute setup units, stream output units, rasterisation engines, memory controllers and a gigathread engine

All of which are fixed functionality, not arbitrary shader processors.

No synchronisation is required, commands to alter objects will simply be queued and executed between frames.

Queued by what? And if such queuing were possible, why isn’t it done now? How do you define a “frame”? When does the state actually change? And how big exactly are these objects? If I’m trying to render 10,000 copies of something, which would have been child’s play with instancing, is that going to require 10,000 objects?

And how does this interact with non-traditional rendering models, like rendering a GUI (typically ad-hoc, without a lot of formal objects for each element) or deferred rendering (the number of passes in the deferred part is based on the number of lights in the scene)?

I am not talking about adding a new stage, i am talking about using the existing NVIDIA gigathread engine or AMD command processor in a slightly different way.

Except that the command processors in question do not execute arbitrary code. They are incapable of processing a state graph. Command processors are very simple pieces of hardware. They execute a FIFO who’s commands are very limited. Set registers, execute rendering, clear cache X, etc. All very trivial.

I would at least like to see Immutable Material objects added on July 25th (Siggraph), after all, this is basically the same as was proposed for Longs Peak back in 2007.
A query that asks the driver for a list giving the most efficient ordering of the material state changes would be nice too…

These two things act as cross-purposes. Immutable combinations of program objects and the particular set of uniforms they use are not the most efficient way to go.

For example, let’s say you have 7 objects. 3 use program A and 4 use program B. Even though they use different programs, they share a UBO between them all. And two of the objects that use program A share a UBO, as do 2 of the objects that use program B.

Ignoring all other state, this leads to the following sequence of bind and rendering commands:

1: Bind program B.
2: Bind common UBO to the common UBO slot (say, slot 7).
3: Bind shared UBO to slot 0.
4: Render object 1.
5: Render object 2.
6: Bind UBO to slot 0.
7: Render object 3.
8: Bind UBO to slot 0.
9: Render object 4.
10: Bind program A.
11: Bind shared UBO to slot 0.
12: Render object 5.
13: Render object 6.
14: Bind UBO to slot 0.
15: Render object 7.

Now, let’s compare this to what you would have to do with immutable “material” objects:

1: Bind object 1’s material.
2: Render object 1.
3: Bind object 2’s material.
4: Render object 2.
5: Bind object 3’s material.
6: Render object 3.
7: Bind object 4’s material.
8: Render object 4.
9: Bind object 5’s material.
10: Render object 5.
11: Bind object 6’s material.
12: Render object 6.
13: Bind object 7’s material.
14: Render object 7.

Looks more efficient, right? You don’t have all of those separate binds that we did in the first one.

However, what you’re not seeing is one simple fact: those binds don’t go away just because we happen to be using an immutable material.

Every time you bind one of these material objects, one of two things has to happen. Either the driver has to be stupid, or it has to be smart.

If the driver is stupid, then it will internally bind all of the state, even if that state was previously bound. So in the above case, we get a performance penalty for binding a common UBO 6 extra times.

If the driver is smart, then it will examine the old material and the new, changing only the state that is necessary to change. The problem here is that there’s no need for that. The driver is wasting time doing something that the application could do much more easily.

The application knows that all of these objects share a certain UBO. The driver doesn’t have to check on every bind whether the incoming material uses a different UBO in that slot; we know it doesn’t. So why make the driver do the work?

Now, you could say that we would have to do the same thing on the CPU. Except that’s not true. The work that the driver does to detect whether there is a shared uniform buffer being used is a lot harder than it is on the client side. Objects that share UBOs likely have other traits in common. Traits that can be used to sort the rendering list properly. Traits the driver does not have.

Doing a sort operation on the list of rendered objects will buy you more performance than immutable materials, even in the case where drivers are written to avoid redundant state changes. And if the drivers are written stupidly, you’re in a world of hurt.

I would rather have low-level drivers that do exactly and only what they’re told, rather than drivers that have to figure out stuff I already know.

First, how can you avoid that cache miss? You don’t know what object you’re going to be reading next until you reference the pointer.
You must have misunderstood me here, i’m just talking about itterating through my own scene-graph so i certainly can prefetch each object.
I didn’t mention the OpenGL name lookup cache miss because i am assuming the use of bindless.

none of what you’re talking about involves the execution of a “LOT of instructions”.
When i trace into an API call with my debugger it sure looks like a lot to me.

And where does this object transformation matrix come from? The GPU isn’t allowed to read arbitrary CPU data, so it must be coming from a buffer object or the parameter of some other object. Which the CPU must set.
Yes, this is a GPU buffer containing the matrix of each object. For static objects it never changes. If i am running an OpenCL physics engine then it is this that updates the matrix of a moving object. If i just want to move an object from the CPU then i just send that command (and new matrix) to the GPU just like a normal OpenGL command.

where are you coming up with this 768KB L2 cache from?
NVIDIA’s Fermi: The First Complete GPU Computing Architecture, A white paper by Peter N. Glaskowsky;
NVIDIA GF100 Whitepaper;
Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi;

The reason bindless works is because of the MakeResident call, which explicitly forbids the implementation from changing or moving the buffer object’s location in memory.
So just use MakeResident for the object buffer. The VBO’s could be made resident as well if you really want. You seem to be suggesting that Bindless is some sort of alternative to a GPU scenegraph or a command shader, but there is no reason not to have both.

Queued by what? And if such queuing were possible, why isn’t it done now?
It is, thats just the OpenGL command buffer. The only difference is that it waits until the command shader finishes executing the frame (The equivalent of a Swapbuffers command) before being executed.

If I’m trying to render 10,000 copies of something, which would have been child’s play with instancing, is that going to require 10,000 objects?
No, you can still use instancing.

how does this interact with non-traditional rendering models, like rendering a GUI (typically ad-hoc, without a lot of formal objects for each element) or deferred rendering (the number of passes in the deferred part is based on the number of lights in the scene)?

This is exactly why i introduced the Command shader, to give you this flexibility. Its a GLSL-like program so you can have a for loop that triggers a full-screen rasterisation for as many lights as you want.
The Material/Object scenegraph idea by itself would simply let you have a single API call to draw all the opaque objects in the scene. (Though given some time to work on the idea i’m sure it could be extended to a depth pre-pass and transparent objects at least).

the command processors in question do not execute arbitrary code. Command processors are very simple pieces of hardware. They execute a FIFO who’s commands are very limited.
Do you have references to back this up? They may have been this simple in older GPU’s, but modern GPU’s like fermi are very complex systems that at least deserve something at the level of an 8086, which is all you would need.
It is certainly possible that they currently execute firmware from ROM or flash, but GPU’s become more flexible and programmable every generation, so it should be easily done in the next generation.

Immutable combinations of program objects and the particular set of uniforms they use are not the most efficient way to go.
Longs Peak seems to have had all of the UBO’s as part of the program object, which i agree is wrong simply because the transformation matrix has to get to the vertex shader somehow, and its per object, not per material.
Perhaps allow UBO’s to be attached to both Material objects and to object objects. (I really need another word for the object being rendered here; Thing object, segment object, section object, Mesh object? Any idea’s?)

<Very long block of text that we wont repeat here>
For a start you put a material bind between objects 1&2 and 5&6 which are obviously unnecisary as they are the same material, and in practical applications you would often have quite a few objects sharing the same material.
As for re-binding all the state on a material change, no driver writer is THAT stupid (well OK, maybe an Intel driver writer).
You say that the driver is wasting time doing comparisons between the old and new materials that can be done better by the application. I dont agree at all, the material objects consist of a few numbers that reference some shaders and a UBO or two, these comparisons are trivial and would have to be done by the application anyway.
Furthermore, if we use my idea of a GPU scenegraph (or even just letting the driver pre-sort the material order) then we can pre-compile the state switching when we create the material objects. During the rendering we dont need to test anything, we just replay the stored state-switching sequence.

Finally you mentioned traits that objects can have that effect the ideal rendering order but which are not OpenGL state or anything the driver can know about.
Can you give an example of what you mean?

Btw guys, let’s not forget GL_ARB_draw_indirect on DX11 hw. Mix with instancing facilities like texture-arrays/etc.
Its unavailability in GL3.3 hints that maybe DX10 cards can’t support such command-buffers.

NVIDIA’s Fermi: The First Complete GPU Computing Architecture, A white paper by Peter N. Glaskowsky;

I must have missed the part of the paper that says what the L2 cache on a Cyprus is.

Between you and Dark Photon, I’m starting to wonder if I stumbled onto the NVIDIA forums or something.

So just use MakeResident for the object buffer. The VBO’s could be made resident as well if you really want. You seem to be suggesting that Bindless is some sort of alternative to a GPU scenegraph or a command shader, but there is no reason not to have both.

There are reasons not to have command “shaders”. I’m outlining them here.

And if we have “bindless” (or whatever form it eventually takes), why do we need command shaders? We’d already be getting performance nearly equivalent to NVIDIA display lists. And if NVIDIA can’t get much more performance than bindless, I don’t see command shaders doing any better.

Do you have references to back this up?

Do you? Besides Fermi, that is.

Why would an IHV waste the silicon and transistors on the command processor for a GPU? The only reason the Fermi might have a more complicated CP is because it’s designed for GPGPU first and as a renderer second.

The CP simply doesn’t do anything worth the extra die space in making it fully programmable.

For a start you put a material bind between objects 1&2 and 5&6 which are obviously unnecisary as they are the same material, and in practical applications you would often have quite a few objects sharing the same material.

You’re assuming that the only material properties are UBOs. In this example, I only showed the UBO properties, but there could just as easily have been shared UBOs but different textures.

As for re-binding all the state on a material change, no driver writer is THAT stupid (well OK, maybe an Intel driver writer).

Never underestimate the stupidity of drivers. It was not too long ago that NVIDIA’s GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an “optimization.”

You’re more likely to get consistently good performance if the specification is tight. A loose specification gives IHVs a lot of room to make things better, but it also allows them to make things worse. That’s why I prefer buffer objects to display lists; VBOs may be slower than DLs sometimes, but they’re consistent.

You say that the driver is wasting time doing comparisons between the old and new materials that can be done better by the application. I dont agree at all, the material objects consist of a few numbers that reference some shaders and a UBO or two, these comparisons are trivial and would have to be done by the application anyway.

Here is the set of data that a material needs:

1: program.
2: textures and where they are bound.
3: UBOs and where they are bound.
4: non-buffer object uniform state (and no, not everything is or should be a UBO).

Some of this state is intrinsically per-instance state. Some of it is shared among several instances. Some of it is global.

The only way for a driver to know what state changes between materials is for them to actually do the test. However fast this may be (and it can’t be that fast) it is still slower than the 0 time that would be spent if the user simply sent the data properly.

The user is at a higher level than the driver. The user has more tools to know what state is global, what state is per-instance, and what is shared. The user does not have to check the basic material properties; it knows all of the “soldiers” share the same array texture.

Finally you mentioned traits that objects can have that effect the ideal rendering order but which are not OpenGL state or anything the driver can know about.
Can you give an example of what you mean?

Shadow mapping. You render to a depth texture, then use that texture for rendering the scene. In that second pass, every shader uses this texture.

The driver doesn’t know this; it will have to check this at every material change for the second pass, even though you the user already know that it isn’t changing. It’s a waste of time.

No problem, as you more than compensate on the pro-ATI and NVidia FUD side. :wink:

P.S. I’d evangelize ATI too if we could get our apps running on their drivers (would be good to have them as an alternative, especially right now), but they keep locking the whole machine up randomly and crashing the app in the driver. And I don’t get bindless from them yet. So no surprise, I don’t relay much ATI experience, good or bad.[/b]

Never underestimate the stupidity of drivers. It was not too long ago that NVIDIA’s GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an “optimization.”

Now you’re dredging, dude. Yes it did happen, but in NVidia GPUs circa 2005-6. NVidia driver writers are still the tops out there in product stability. (Disclosure: No, I don’t work for them.)

Between you and Dark Photon, I’m starting to wonder if I stumbled onto the NVIDIA forums or something.
I have nothing against ATI, in fact there are some things like their closer adherence to the spec that i prefer over NVIDIA.
I used Fermi as an example simply because i am checking it out at the moment and have the documents to hand.
It would have been a more useful comment if you actually told us what the L2 cache on a cypress is.

why do we need command shaders? We’d already be getting performance nearly equivalent to NVIDIA display lists. And if NVIDIA can’t get much more performance than bindless, I don’t see command shaders doing any better.
Display lists, CBO’s, or command shaders are all designed to reduce the number of API calls the application has to make. If your application is not CPU bound then you wont see any difference, if your CPU has too much work to do and cant keep up with the GPU, then it can make a big difference.
On the GPU side all that matters is that the shader processors are kept busy and dont stall waiting for the CPU to catch up.
But the main advantage of command shaders is that you could do conditional branching in the rendering loop that depends on GPU state, which currently stalls the pipeline if you try to do it from a CPU rendering loop.

It was not too long ago that NVIDIA’s GL drivers constantly recompiled shaders when you changed certain uniform state. This was considered an “optimization.”

This wasn’t stupidity, this was NVIDIA trying to make certain benchmarks run faster so they got better scores and hence sold more cards.

However fast this may be (and it can’t be that fast) it is still slower than the 0 time that would be spent if the user simply sent the data properly.
Lets see, the materials have a variable called “Vertex Shader” that contains the name of a compiled shader object (or maybe its GPU address), so to find out if the vertex shader changed we need to compare two numbers. Last i heard CPU’s are pretty good at that sort of thing. If the names match then we dont need to do anything, if not we bind the new shader. Repeat for 4 other shaders, a couple of UBO’s and some textures, a few billionths of a second extra.
And how do you get “zero” time for your application to do the same thing? You either do the comparison yourself as part of your scenegraph logic, or create a display list for each material-to-material change.
But the real problem is that for each of your shader and UBO changes the driver needs to do some validation checks to ensure that what you are telling it to do makes sense.
With material objects the validation is done when they are created, so when we change material the driver simply changes the pipeline state without repeating these checks every time.
In your shadow mapping example, the driver does have to check each material has the same texture bound as the previous one did, but its a single CPU instruction, and how many materials per frame would you have anyway? It would take an awful lot for this time to even be detectable.

If we can get back to what this thread was originally about:
minimising API calls when you have many thousands of meshes that are too different to use instancing, and i assume have their own Model matrix that needs to be written to a UBO before another API call to draw from its VBO.

I’ll list all the options i can think of with my personal opinions, i would like to hear what everybody else thinks is the best option(s), or if you can think of any other ways to do this.

Traditional Display Lists
Gives the vendor the most oportunities for optimisation, and can be lightning fast for fixed geometry, but useless for a world that is constantly changing as they need to be recompiled from scratch and this takes too long.

Mark Kilgard’s Enhanced Display Lists (Siggraph Asia 2008)
-Compile commands into display lists that defer vertex and pixel transfers until execute-time rather than compile-time.
-Allow objects (textures, buffers, programs) to be bound “by
reference”.
-Conditional display list execution.
-Relaxed vertex index and command order.
-Parallel construction of display lists by multiple threads.
This is a lot more flexible, objects can be animated and building a modified display list does not stall rendering.
But if the world is changing too quickly then the effort of continuously rebuilding display lists could exceed the gains from using them.

Aleksandar’s command buffer object
Similar to a display list but limited to commands.
The main difference being that it is organised as an array of command slots, allowing it to be edited.
This would allow some flexibility in changing an objects parameters without having to re-compile it, but then you lose all the optimisations of that compilation, and adding/removing commands could cause a fragmentation problem.

Longs Peak Program Objects
Reduces several API calls to one, and pre-validates the state settings so less work needs to be done to switch materials.
All UBO’s seem to be attached to program objects, but then how do you set the per-mesh transformation matrix?
When switching to a new program object the driver must check all attached shaders, UBO’s and textures to determine which state needs to be set.

Material Objects
Similar to above, but the driver sorts the material objects into the most efficient rendering order.
This allows the state changes to be optimised and pre-compiled just like display lists.

GL_ARB_draw_indirect
Allows several meshes (from different parts of the same VBO) to be drawn (and multiple instances generated) from data stored in a structure in a GPU buffer object.
This puts all the objects into the same VBO, so they will become fragmented if you add & remove objects.
The main use of this seems to be to allow an OpenCL program to switch between different meshes on the fly or change the number of instances, though i cant really see where i would use this, a physics engine would either animate an object by moving the vertices in the VBO, or change the ModelView matrix to be used with the mesh.
It doesn’t specify how you position each of these meshes in the world, i would assume you need a UBO containing a ModelView Matrix for each instance of each mesh.

Mesh Buffer Objects
Similar in concept to CBO’s, its an array of slots used to draw meshes. But instead of arbitrary commands, each slot contains the Model (or ModelView?) matrix of a particular mesh in the game world, and a pointer to the VBO that describes its shape.
For efficient rendering there could be one MBO per material, allowing a single API call to draw the whole lot at once.
Could still have fragmentation problems as removed meshes leave holes in the array.
Could use a linked-list of mesh objects, but then that could cause cache misses unless the whole MBO is prefetched.

Command Shader
Move the entire main rendering loop from the CPU to the GPU.
Will probably require enhancements to the GPU command processor, hence is for future hardware only.

Lets see, the materials have a variable called “Vertex Shader” that contains the name of a compiled shader object (or maybe its GPU address), so to find out if the vertex shader changed we need to compare two numbers. Last i heard CPU’s are pretty good at that sort of thing. If the names match then we dont need to do anything, if not we bind the new shader. Repeat for 4 other shaders, a couple of UBO’s and some textures, a few billionths of a second extra.

Doing comparison operations in a tight loop is a big no-no for performance. Branches, and branch mis-prediction, is bad.

Also, you are ignoring the non-buffer object based uniforms in the program. Not every uniform is UBO based, nor should it be.

And how do you get “zero” time for your application to do the same thing? You either do the comparison yourself as part of your scenegraph logic, or create a display list for each material-to-material change.

Because I’m doing a different comparison. It’s the reason why high-level logic can organize data faster than a low-level sorting algorithm: it knows what the data is for. I can write an optimal sorter because I know what data comes from where, what uses which shaders, what things are shared with other things, etc.

If I’m rendering a “soldier”, then I know that he is made up of a number of rendered objects that use certain programs. I know that he uses a texture array atlas that is shared among all soldiers. I know

From a single comparison of “entity type == soldier”, I have already done the equivalent work of 20+ comparisons by the driver.

But the real problem is that for each of your shader and UBO changes the driver needs to do some validation checks to ensure that what you are telling it to do makes sense.

Does it? And considering that it is possible to delete objects at pretty much any time, materials would need the same object validation.

If we can get back to what this thread was originally about:
minimising API calls when you have many thousands of meshes that are too different to use instancing, and i assume have their own Model matrix that needs to be written to a UBO before another API call to draw from its VBO.

The problem is this: we have insufficient evidence that there is a significant performance penalty coming specifically from function call overhead. Bindless graphics doesn’t get its performance from reducing the number of function calls; Dark Photon demonstrated that with his test where he used redundant glVertexAttribFormat calls.

Display Lists don’t get their performance on NVIDIA implementations because of lower function call overhead. They get their performance by properly optimizing the sequence of rendering steps for the hardware in question. Putting all of the data in a driver-controlled buffer object, reformatting the data to be optimal when read, etc.

You shouldn’t try to fix a problem unless you have evidence that the problem exists. Thus far, if such evidence exists, it has not been presented.