If done on the CPU, then after reading the object (CPU cache miss can be avoided by prefetch) we first need to call the driver to put the matrix into a UBO, then call it again to draw from the VBO (both of which execute a LOT of instructions).
First, how can you avoid that cache miss? You don’t know what object you’re going to be reading next until you reference the pointer.
Second, none of what you’re talking about involves the execution of a “LOT of instructions”. It involves a lot of work, due to the synchronization needed in updating a buffer object’s contents. But this is not a lot of instructions.
And the GPU version needs to do that synchronization too.
If done on the GPU, then we copy object transformation matrix directly to UBO, read the address of the VBO, then immediately start reading vertices. MUCH less work overall.
And where does this object transformation matrix come from? The GPU isn’t allowed to read arbitrary CPU data, so it must be coming from a buffer object or the parameter of some other object. Which the CPU must set. This requires CPU/GPU synchronization.
In many cases the entire object buffer would be small enough to remain in the 768 KB L2 cache anyway.
You’re making an assumption that the quantity of data used by the shaders is rather small. I imagine that the state graph for scenes of significance exceeds 1MB. At least, the ones that are state-change or drawing call bound, rather than shader bound.
Also, where are you coming up with this 768KB L2 cache from?
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.
In the GPU case the name-to-address translation was done during compilation, so bindless is irrelevant.
Not if you’re doing what you’re talking about. So long as those buffer objects can be affected by the CPU and you expect the results of those changes to be reflected in rendering, the GPU-based scene graph code must still be using the buffer object’s name. Buffer objects can be moved around by the creation/destruction of other memory objects.
The reason bindless works is because of the MakeResident call, which explicitly forbids the implementation from changing or moving the buffer object’s location in memory.
in the fermi block diagram i can see texture units, vertex fetch units, tesselators, viewport transform units, attribute setup units, stream output units, rasterisation engines, memory controllers and a gigathread engine
All of which are fixed functionality, not arbitrary shader processors.
No synchronisation is required, commands to alter objects will simply be queued and executed between frames.
Queued by what? And if such queuing were possible, why isn’t it done now? How do you define a “frame”? When does the state actually change? And how big exactly are these objects? If I’m trying to render 10,000 copies of something, which would have been child’s play with instancing, is that going to require 10,000 objects?
And how does this interact with non-traditional rendering models, like rendering a GUI (typically ad-hoc, without a lot of formal objects for each element) or deferred rendering (the number of passes in the deferred part is based on the number of lights in the scene)?
I am not talking about adding a new stage, i am talking about using the existing NVIDIA gigathread engine or AMD command processor in a slightly different way.
Except that the command processors in question do not execute arbitrary code. They are incapable of processing a state graph. Command processors are very simple pieces of hardware. They execute a FIFO who’s commands are very limited. Set registers, execute rendering, clear cache X, etc. All very trivial.
I would at least like to see Immutable Material objects added on July 25th (Siggraph), after all, this is basically the same as was proposed for Longs Peak back in 2007.
A query that asks the driver for a list giving the most efficient ordering of the material state changes would be nice too…
These two things act as cross-purposes. Immutable combinations of program objects and the particular set of uniforms they use are not the most efficient way to go.
For example, let’s say you have 7 objects. 3 use program A and 4 use program B. Even though they use different programs, they share a UBO between them all. And two of the objects that use program A share a UBO, as do 2 of the objects that use program B.
Ignoring all other state, this leads to the following sequence of bind and rendering commands:
1: Bind program B.
2: Bind common UBO to the common UBO slot (say, slot 7).
3: Bind shared UBO to slot 0.
4: Render object 1.
5: Render object 2.
6: Bind UBO to slot 0.
7: Render object 3.
8: Bind UBO to slot 0.
9: Render object 4.
10: Bind program A.
11: Bind shared UBO to slot 0.
12: Render object 5.
13: Render object 6.
14: Bind UBO to slot 0.
15: Render object 7.
Now, let’s compare this to what you would have to do with immutable “material” objects:
1: Bind object 1’s material.
2: Render object 1.
3: Bind object 2’s material.
4: Render object 2.
5: Bind object 3’s material.
6: Render object 3.
7: Bind object 4’s material.
8: Render object 4.
9: Bind object 5’s material.
10: Render object 5.
11: Bind object 6’s material.
12: Render object 6.
13: Bind object 7’s material.
14: Render object 7.
Looks more efficient, right? You don’t have all of those separate binds that we did in the first one.
However, what you’re not seeing is one simple fact: those binds don’t go away just because we happen to be using an immutable material.
Every time you bind one of these material objects, one of two things has to happen. Either the driver has to be stupid, or it has to be smart.
If the driver is stupid, then it will internally bind all of the state, even if that state was previously bound. So in the above case, we get a performance penalty for binding a common UBO 6 extra times.
If the driver is smart, then it will examine the old material and the new, changing only the state that is necessary to change. The problem here is that there’s no need for that. The driver is wasting time doing something that the application could do much more easily.
The application knows that all of these objects share a certain UBO. The driver doesn’t have to check on every bind whether the incoming material uses a different UBO in that slot; we know it doesn’t. So why make the driver do the work?
Now, you could say that we would have to do the same thing on the CPU. Except that’s not true. The work that the driver does to detect whether there is a shared uniform buffer being used is a lot harder than it is on the client side. Objects that share UBOs likely have other traits in common. Traits that can be used to sort the rendering list properly. Traits the driver does not have.
Doing a sort operation on the list of rendered objects will buy you more performance than immutable materials, even in the case where drivers are written to avoid redundant state changes. And if the drivers are written stupidly, you’re in a world of hurt.
I would rather have low-level drivers that do exactly and only what they’re told, rather than drivers that have to figure out stuff I already know.