direct_state_access reloaded

Alfonse_Reinheart · September 28, 2009, 10:56am

Do not change API just because you think you cannot make better implementation of it.

The easier you make it to have implementations provide more performance, the more likely it will be that implementations will actually provide that performance.

The_Fiddler · September 28, 2009, 1:38pm

So says you. However note that DirectX changes APIs every single version and it’s the better API for it.

mfort · September 28, 2009, 2:03pm

So says you. However note that DirectX changes APIs every single version and it’s the better API for it.

So cheap argument. DirectX is not evolving. Every single version is a revolution. They do not need to keep compatibility. They don’t care. Their philosophy is to make new API from scratch all the time. Made it well done, as close as possible to current HW technology. They are doing this by purpose, not because they have poor programmers that cannot keep compatibility. They made their way and they are successful in it. OpenGL is different. Many small, evolving steps. As much compatible as possible.

Alfonse_Reinheart · September 28, 2009, 2:31pm

OpenGL is different. Many small, evolving steps. As much compatible as possible.

Which is why OpenGL is worse. It is more prone to driver bugs (due to the complexity of the vast API), contains innumerable ideas that made sense at the time but are ridiculous from a modern perspective (go ahead; explain how to attach a buffer object to a VAO to someone or how to attach a texture to a program), etc.

kRogue · October 14, 2009, 1:15pm

Bindless graphics is a horrible API. It breaks the basic model of vertex shaders (let alone uniforms), requiring you to write shaders specifically for it. It is also incredibly low-level, which makes widespread implementation difficult if not impossible.

I soooooooooo disagree on it being a horrible API. Also, Alfonse Reinheart has his wishes of doing VAO locks, so beware of not so neutral opinions.

Here is why I like the bindless API:

very, very straight forward to use: allocate buffer, make memory resident. That is it. If you re-allocate the buffer then you have to make it resident again. Pretty simple and straight forward in my eyes. For vertex attributes it does not require one to rewrite shaders or anything.
clearly breaks the glVertexAttribPointer call into two functions that
a. set format of data
b. set source of data
Pointers in shaders! Maybe this is what Alfonse Reinheart is saying about needing to rewrite your shaders? Without bindless graphics putting a complicated scene graph uploaded to the GPU requires a high level of trickiness and not something I’d want to do by hand!

As for it being to low level, come on people, how low level is really providing a GPU address for a block of GPU managed memory. If GLint64/GLuint64 bothers you, lets just do this:

typedef uint_64 GL_buffer_object_address;

and then it will look like the bindless API is abstract now(rather than calling it an address call it a lock-binding point or something, giggles). As for being too low level, buffer objects are all ready very low level, it is not like buffer objects are allowed to compress their data or anything, it is raw bytes whose memory management is done by GL (with all the horror of indirect rendering across different endianness one can imagine!). Additionally, UBO’s have much nastier packing rules than nVidia’s bindless API.

My 2 cents.

Alfonse_Reinheart · October 14, 2009, 2:59pm

Without bindless graphics putting a complicated scene graph uploaded to the GPU requires a high level of trickiness and not something I’d want to do by hand!

If I were doing something where I needed a “complicated scene graph,” I’m pretty sure I’d use OpenCL. This is why OpenCL exists; so that OpenGL’s shading language can be a shading language, not an arbitrary programming language.

As for it being to low level, come on people, how low level is really providing a GPU address for a block of GPU managed memory.

Very. When you deal with direct pointers to things, you are dealing at a low level. And pretending to hide the pointer doesn’t help.

If you can achieve the same performance effects of bindless without breaking the abstraction the way it does, then you should. That’s why mapping buffers is OK; it allows you to get performance that you otherwise couldn’t.

I have yet to see evidence that an alternate API, aimed at the particular gains of bindless graphics but while preserving the abstraction, would be unable to achieve similar results.

As for being too low level, buffer objects are all ready very low level, it is not like buffer objects are allowed to compress their data or anything, it is raw bytes whose memory management is done by GL (with all the horror of indirect rendering across different endianness one can imagine!).

It isn’t low level, because you are unable to directly access or affect this memory. The memory has a controlled interface, which is what gives drivers freedom.

Additionally, UBO’s have much nastier packing rules than nVidia’s bindless API.

That’s because UBOs are, get this, cross platform. Bindless graphics only has to work on NVIDIA hardware. There is a reason why this is an NV extension, and not an EXT extension like separate shader objects.

And the std140 packing rules are basically standard C. I fail to see how this is “nasty” in any way.

kRogue · October 15, 2009, 12:33pm

That’s because UBOs are, get this, cross platform. Bindless graphics only has to work on NVIDIA hardware. There is a reason why this is an NV extension, and not an EXT extension like separate shader objects.

And the std140 packing rules are basically standard C. I fail to see how this is “nasty” in any way.

cough hack. UBO’s have a funny packing when it comes to vec3, ivec3 and uvec3… they all take up the room of a 4 vector, i.e. simple 32-bit aligned packing rules are not enough to describe, or for that matter 64-bit packing rules. Though in truth the issue is mute: you can query GL for the offsets anyways. Additionally the bindless graphics API does define packing rules, and actually they are 9/10 easier than UBO.

Buffer objects are only quasi-cross platform, do indirect rendering between different endianess to see what I mean.

It isn’t low level, because you are unable to directly access or affect this memory. The memory has a controlled interface, which is what gives drivers freedom.

um cough again. Have you really read the bindless extensions at all really, or even tried to use them. Firstly bindless graphics is only about reading from buffers, there is nothing there about writing, the only part that is required is that GL needs to be told to make the buffer avialable and to give an address for the GPU to use it. Change the word address to handle and you are all set!

Just ot make sure we areon the same page, lets take a look at what it’s interface for the pointers in shaders is:


GLSL over-simple example:
uniform float **funkyness;
in ivec2 indexing;

//
funkiness_I_want=funkyness[indexing.x][indexing.y]

Nothing weird there, looks like can finally do lots of things we take for granted everywhere else. In fact this is much clearer than say packing data into several texture buffer objects once you get into more complicated structures (since a fixed texture buffer object can only return 1 type always). Now for the GL side:


GLuint *funky_buffers, funkyness;
GLuint64 *funky_buffer_addresses, funkiness_address;

funky_buffers=new GLuint[dimX];
funky_buffer_addresses=new GLuint64[dimX];
glGenBuffers(&funky_buffers, dimX);
glGenBuffers(&funkyness, 1);



for(int i=0;i<dimX;++i)
{
  //allocate them
  glBindBuffer(GL_ARRAY_BUFFER, funky_buffers[i]);
  glBufferData(GL_ARRAY_BUFFER, sizeof(float)*dimY, NULL, usage_enum);

  //make the buffer resident:
  glMakeNamedBufferResidentNV(funky_buffers[i], GL_READ_ONLY);

  //get the "address"
  glGetNamedBufferParameterui64vNV(funky_buffers[i],        GL_BUFFER_GPU_ADDRESS_NV, &funky_buffer_addresses[i]);
}

//fill funkyness with the "poitners" to each buffer object
glBindBuffer(GL_ARRAY_BUFFER, funkyness);
glBufferData(GL_ARRAY_BUFFER, sizeof(GLuint64)*dimY, funky_buffer_addresses, usage_enum);

glMakeNamedBufferResidentNV(funkyness, GL_READ_ONLY);
glGetNamedBufferParameterui64vNV(funkyness, GL_BUFFER_GPU_ADDRESS_NV, &funkiness_address);

//do whatever you like to fill the buffer data with
//glBufferSubData or transform feed back, or whatever
//just don't reallocate the buffer object with glBufferData
//also note that you can change what buffers funkiness uses
//by just changing the values.

GLint funkyness_uniform;

funkyness_uniform=glGetUniformLocation(GLSLProgram, "funkiness");

glUniformui64NV(funkyness_uniform, funkiness_address);

How does that break that break abstraction really? change the word address and the type GLuint64 to say “locked-buffer-id” and “GL_locked_buffer_id_type”.

Lastly:

If I were doing something where I needed a “complicated scene graph,” I’m pretty sure I’d use OpenCL. This is why OpenCL exists; so that OpenGL’s shading language can be a shading language, not an arbitrary programming language.

Unfreaking believable, really. If one can send the data in a more flexible way to the shader then that is SOOO much better. Simple things like skinning are much easier with bindless than without (MD5 skinning is much easier to write with bindless than without) Bindless graphics also gets rid of something that is so irritating in 3d graphics: the endless cleverly repacking of vertex data to fit into the simple vertex attribute model. An additional bit is this: alot of the need to stream vertex data to the GPU is no longer needed with bindless graphics, you can do all the calculation on GPU with the much of the flexibility one takes for granted on CPU. With bindless graphics, if you are sick enough, you can reduce rendering many different models to just one instanced draw call, not just models in different places, but models with different data sets entirely. Weather or not this is the best thing for performance is not clear since:

What are the performance characteristics of buffer loads?

RESOLVED: Likely somewhere between uniforms and texture fetches,

but totally implementation-dependent. Uniforms still serve a purpose

for "program locals". Buffer loads may have different caching

behavior than either uniforms or texture fetches, but the expectation

is that they will be cached reads of memory and all the common sense

guidelines to try to maintain locality of reference apply.

One more nasty bit:

Which is why OpenGL is worse. It is more prone to driver bugs (due to the complexity of the vast API), contains innumerable ideas that made sense at the time but are ridiculous from a modern perspective (go ahead; explain how to attach a buffer object to a VAO to someone or how to attach a texture to a program), etc.

Giggles, EXT_direct_state_access handles most of that quite well, in at this point the best thing to do for DSA is to jsut take the extension into the spec, for the compatibility profile as is, for the core profile just remove all the references to removed stuff. On the subject of writing drivers, take a look at slide 37 of Kilgard presentation:

Deprecation – Myths
-Feature removal will result in a faster driver
-Feature removal will result in a higher quality driver
-Feature removal will result in a cleaner API
-Not removing features means OpenGL will die
-Only useless features were deprecated
----Far from true

Considering who Kilgard is, I tend to take his word.

Alfonse_Reinheart · October 15, 2009, 1:46pm

UBO’s have a funny packing when it comes to vec3, ivec3 and uvec3… they all take up the room of a 4 vector, i.e. simple 32-bit aligned packing rules are not enough to describe, or for that matter 64-bit packing rules.

This is not “funny” packing. It’s quite common when dealing with low-level SEE-type math operations that vec3’s take up the same room as vec4’s.

Buffer objects are only quasi-cross platform, do indirect rendering between different endianess to see what I mean.

How would that even be possible? Wouldn’t that mean that you had a CPU with a different endianness than the GPU that you’re using? That would break one of the basic assumptions of buffer objects.

And using pointers instead of buffer objects wouldn’t improve this any. So I’m not really sure what your point here is.

In fact this is much clearer than say packing data into several texture buffer objects once you get into more complicated structures

I fail to see how this would not be achievable with simply a more flexible implementation of uniform buffer objects. NVIDIA is perfectly capable of, via pointers, making UBO accesses into pointer accesses behind the scenes. So why don’t they?

It would be even cleaner, since you would not need access to actual pointers.

the endless cleverly repacking of vertex data to fit into the simple vertex attribute model.

If your vertex data is actually vertex data, where is the “repacking” coming from?

If you need a general-purpose computation API, OpenCL exists. I see no need to make GLSL into that.

Considering who Kilgard is, I tend to take his word.

Considering that Kilgard works for NVIDIA, who does not have a vested interest in making OpenGL implementations easier to write (since they already have one. They want Intel’s job with Larrabee to be as hard as possible), I’ll go with the empirical data: ATI’s D3D implementation is more solid than their OpenGL implementation, and D3D implementations are easier to write than GL implementations.

Further, his statements make no sense. It is a verifiable fact that the least buggy code is the code that is never written. So while feature removal will not guarantee these things, not removing features certainly isn’t helping.

Unfortunately, Kilgard’s words can’t make ATI or Intel’s OpenGL implementations better. Making OpenGL implementations simpler has at least some chance of working.

kRogue · October 15, 2009, 11:42pm

This is turning into a flame war, but oh well, since you do not know what indirect rendering even is:

How would that even be possible? Wouldn’t that mean that you had a CPU with a different endianness than the GPU that you’re using? That would break one of the basic assumptions of buffer objects.

And using pointers instead of buffer objects wouldn’t improve this any. So I’m not really sure what your point here is.

SO jsut so you know: indirect rendering is where the process and the the GL-server are on different machines. One can with X-windows launch a process on one machine and have it render on another, this rendering includes GL. Now what happens when the endianess of where the process is running and the X-server don’t match? All hell breaks loose with respect to buffer objects. The endianness of the data inside a buffer object is the endianness of the server, not the process. So now you pack data into your buffer object, you have to make sure that you pack in the endianness of the X-server where it is being rendered. This is a big deal under a variety of circumstances. Before buffer objects the vertex data was considered client data and as such the transport mechanism and GL driver took care of the endianness issues, but with buffer objects it is server state and must be in the endianness of the server. My point on the buffer objects being server memory is that the abstraction leaks significantly anyways, you need to know that the endianness of the server and the machine running the process.

I fail to see how this would not be achievable with simply a more flexible implementation of uniform buffer objects. NVIDIA is perfectly capable of, via pointers, making UBO accesses into pointer accesses behind the scenes. So why don’t they?

You are really missing some critical bits:

UBO’s have a very, very fine limit on size.
There is a very hard limit on the number of UBO’s available
The rule of thumb of UBO’s is that it is slowed than a uniform access but faster than everything else subject to sequential caching rules.

Lets give a simple, simple, example where bindless graphics is definitely worthwhile.

We have two key frame meshes, call them A and B, separately animated, each with a different number of frames. You wish to create a mesh where some vertices are from A and some from B. Of critical importance is that some triangles have vertices from A and B. Texture co-ordinates however are not taken from A or B.

How would you do this without bindless graphics? The easiest, not to mention dumbest thing, is that one creates a new keyframe mesh which has number_frames=number_frames(A)*number_frames(B) and proceed directly from there. Another approach is to use transform feed back, but all of these answeres are actually silly, this is an example where API is getting in the way. Bindless graphics gives you this shader:


uniform mat4 *matrixTransformations;
uniform vec4 **meshVerticesFrame0, **meshVerticesFrame1;
in ivec2 which_vertex; // .x holds which mesh, .y holds which vertex
uniform float *t;

void main(void)
{
   vec4 v;

   v=matrixTransformations[which_vertex.x]*
     mix(meshVerticesFrame0[which_Vertex.x][which_Vertex.y],
         meshVerticesFrame0[which_Vertex.x][which_Vertex.y],
         t[which_Vertex.x]);

   //whatever more...
}

simple, easy to read and even supports an arbitrary number of meshes. This was just a quick simple job, and the GL code since we do not have extra steps at all is much, much easier.

Lets move onto md5 skinning, ok?

For md5 skinning a vertex v, is computed as


for(i=0, p=vec3(0,0,0); i<number_weights(v); ++i)
{
  p+= weight(v,i) * matrix[ which_joint(v,i) ]*weight_position(v,i);
}

the typical way to map that into GL without bindless graphics is to set a hard maximum number of the number of weights and then for each possible weight one uses an attribute. This incurs a memory waste since some vertices have lots of weights and some have very few. As an exercise write it with bindless graphics, and observe that less video memory is needed, the code is easier to read on both the GLSL and GL side.

Considering that Kilgard works for NVIDIA, who does not have a vested interest in making OpenGL implementations easier to write (since they already have one. They want Intel’s job with Larrabee to be as hard as possible), I’ll go with the empirical data: ATI’s D3D implementation is more solid than their OpenGL implementation, and D3D implementations are easier to write than GL implementations.

Further, his statements make no sense. It is a verifiable fact that the least buggy code is the code that is never written. So while feature removal will not guarantee these things, not removing features certainly isn’t helping.

Now you are really beginning to shovel it with such choice gems that nVidia is trying to make GL harder to implement, unbelievable. The reason why that up until a year or 2 ago that ATI had poor GL drivers was simple: they did not spend the man power on it, just enough to run Quake/Doom/Id Game. D3D drivers are not easy to write and can be quite hairy too, Have you written D3D drivers? Have you written GL drivers?

Alfonse_Reinheart · October 16, 2009, 10:48am

UBO’s have a very, very fine limit on size.

But they do not have to. If NVIDIA wanted, they could implement UBOs as actual pointers under the hood. Then, they could simply put a really big number on the size limit.

And if this is not possible (presumably because uniforms must have a definite size when defined in GLSL), they could simply have made an extension relaxing that limitation. That is, you could define a uniform like:


uniform mat4 myMatrixList[];

This would only be legal in a uniform block. The size then becomes whatever the user gives it. There would be specific grammar restrictions on how this can work (unbounded arrays must be the last thing in the block, etc), but there is nothing preventing this from being implemented.

This provides similar functionality while maintaining the abstraction. The only thing you lose is indirection: the ability to put a pointer inside a uniform and access it indirectly. Essentially uniform within a uniform.

Note: I’m not arguing against the utility of bindless. Yes, you can find uses for it. I’m arguing against the fact that it breaks a very useful abstraction. And it does so without needing to.

There is a very hard limit on the number of UBO’s available

See above.

How would you do this without bindless graphics?

If “bindless” was implemented as above, it would work just fine. It would also be cross-platform, rather than NVIDIA-specific.

the typical way to map that into GL without bindless graphics is to set a hard maximum number of the number of weights and then for each possible weight one uses an attribute.

The typical way this is done is to limit the number of weights to 4, so that the weights all fit into 1 attribute. Yes, this does waste memory for vertices with fewer than 4 weights. But it is certainly good enough.

I would also point out that you lose something with bindless. If your mesh data is no longer using attributes to get information, then you also lose the automatic conversion to the input type. You can pass unsigned bytes normalized on [0,1] as attributes.

But if you want to do that with bindless and avoid attributes, then your shader must specifically be written to use and expect unsigned bytes, and it must be specifically written to do the conversion, including normalization). If you have one mesh that uses unsigned bytes and one mesh that uses unsigned shorts, you must have and maintain two different shaders.

I’ll take the hardcoded, efficient, and free conversion logic for attributes over that.

Now you are really beginning to shovel it with such choice gems that nVidia is trying to make GL harder to implement, unbelievable.

I never said that. I said that they do not have a vested interest in making implementations easier. That is different from saying that they are actively trying to make them harder.

Not being interested in making things easier means that they provide no support for doing so. So when an NVIDIA spokesperson comes along and says that making OpenGL less complex will not improve buggy implementations, I will take this comment with the proper skepticism based on where it comes from.

Have you written D3D drivers? Have you written GL drivers?

It is still an order-of-magnitude easier to write a D3D10 driver than a full 3.2 compatibility OpenGL implementation. Just a few of the things you have to do in GL that you don’t in D3D10:

1: TexEnv.
2: Fixed-function T&L.
3: Selection.
4: The stupid form of feedback.
5: Partial fixed function interactions (when some things are shaders and others are FF).
6: GLSL compiler (D3D implementations only have to write to an assembly language).

These are not small tasks. TexEnv in particular is a highly complicated bit of shader building, as is the fixed function T&L.

Yes, it can be done; both NVIDIA and ATI have done this. But it is still non-trivial code that has to be written and tested. D3D does not have this.

kRogue · October 16, 2009, 12:24pm

But they do not have to. If NVIDIA wanted, they could implement UBOs as actual pointers under the hood. Then, they could simply put a really big number on the size limit.

But they probably NEED to. The expected access performance for UBO’s is higher than bindless. Additionally, if you take that view, the UBO’s should not have any practical size limits, just as samplerBuffer’s don’t. But the point is that each has a different expected usage pattern and as such they are implemented differently. Even over in D3D10 land the equivalents to UBO and Texture buffer objects are different and the MSDN article on them goes on (and on) about that too.

The typical way this is done is to limit the number of weights to 4, so that the weights all fit into 1 attribute. Yes, this does waste memory for vertices with fewer than 4 weights. But it is certainly good enough.

ROFL. Open up an MD5 mesh from Doom3, a game a generation old, and see that 4 is NOT enough.

I would also point out that you lose something with bindless. If your mesh data is no longer using attributes to get information, then you also lose the automatic conversion to the input type. You can pass unsigned bytes normalized on [0,1] as attributes.

Keep in mind that the driver does this conversion for you, and ahem, it is not at all for free. Additionally, like 99.99% of the time the form of the input data is pretty fixed, and you do not vary the form of the input data for a fixed shader, so the ability to change the format/interpretation of the data without changing the shader has a pretty weak use case, really freaking weak.

if “bindless” was implemented as above, it would work just fine. It would also be cross-platform, rather than NVIDIA-specific.

NO. Even allowing stuff like MyData because you have to put it at then end of an UBO you are not matching bindless at all, because you cannot do:


struct
{
  mat3 some3Mats[];
  mat4 some4Mats[];
}

but you can do in bindless:


struct
{
 mat3 *someMat3s;
 mat4 *someMats4;
}

Shoot with bindless you can do SO much more:


struct perThingy
{
  mat4 *funky1;
  float giggles;
  int foobar;
}

struct
{
  struct perThingy **holybatman;
  vec3 ***jumpyWillikirs;
  mat3 m;
};

The main issue is what is the data type of the “pointer” over on the GL side? Here again, you can abstract just as follows:


GLbuffer_binding v;

glMakeNameBufferResident(myBuffer);
v=glGetNameBufferGLSLHandle(myBuffer);

//later:
ptr[location]=v;
glBufferSubData(blah.blah);

How does that break the abstraction? It also clearly admits something we have to see immediately: all a buffer object is, is memory managed and manipulated through GL. That is it. There is no abstraction in that, nothing more. There is no abstraction there. The only possible bitch slap you can have is that different GL implementations may handle buffer data in a real wonked out way, maybe the “GPU address” is like 1024 bytes or something, in that case bindless is hosed, it assumes that GLSL could quickly access the memory of a buffer object by looking at something 64 bits wide, where maybe like in 2140 we’ll need 1024 bit wide addresses or something equally silly (actually 64bit OS’s don’t even use the full 64bits as an address anyways). The natural hack fix is then that one would have to query GL for the size of the thingy so GLSL could get to the memory faster, but this is hackyish and would be horribly awkward to use.

Just to keep harping on how much great bindless is consider a typical differed shading system: you draw to some offscreen buffers typical stuff:

diffuse color
normal
specular
positional data (typically just z)
material ID.

where material ID sets the differed shader to use. Now, if you wanted to support say per-mesh data, then you would have to start packing that data into one (or two) common buffer objects and bind them as texture buffer objects, and naturally fetch the offsets for that pixel into those buffers with another value (or pack an offset into the g-buffer which in turn refers to another texture buffer object which in turn holds the material ID’s etc). Now with bindless you don’t have to pack the data so awkwardly, you can do what you really want: POINT to the data, with it increased readability and much easier GL code to.

I am not even going to start harping on your driver comments, really I am not… must control myself… must… sighs have to say it:

glTexEnv is unfortunatley really two functions wrapped up into one:
a. controls multi-texturing for fixed function pipeline fragment uber-“shader”.
b. GL_TEXTURE_FILTER_CONTROL: affects choosing texture level of detail

For a. again, what we are seeing is that the fixed T&L is an “uber”-shader. For b. that is awkward, I cannot even defend it.

2.Fixed-function T&L. Not that it matters, if you look at it correctly it just means that the GL implementation provides a default uber-vertex and fragment shaders linked together (with name 0) and a variety of state from fixed T&L like GL_LIGHT, etc, are mapped to appropriate under the covers glUniform calls.
3. Selection: ahem, like 99.99% chance this is implemented all on CPU anyways nowadays.
4. GL_FEEDBACK, ie. stupid feedback, same story: 99% chance in software as well.
5. partial fixed function interactions: enter EXT_seperate_shader_objects and that fixed T&L vertex and framgnet stages are private shaders of the driver bound to name 0.

GLSL compiler (D3D implementations only have to write to an assembly language).

This is almost right and worth noting, except that the assembly has to be transformed into whatever the GPU really thinks in. It would be nice that the Kronos would update the asm style shaders (rather than just nVidia updating it and giving it NV extension status only) and then we could see GLSL compilers as external programs… well actually we already have that on nVidia platform anyways, it is called cgc -oglsl. But wait there is more! Different GPU architectures, just like different CPU’s, will want to schedule instructions and break them down differently. So now to take a generic assembly interface a D3D driver most likely has buried in it a dynamic recompiler. Joy.

Alfonse_Reinheart · October 16, 2009, 12:58pm

But they probably NEED to. The expected access performance for UBO’s is higher than bindless. Additionally, if you take that view, the UBO’s should not have any practical size limits, just as samplerBuffer’s don’t. But the point is that each has a different expected usage pattern and as such they are implemented differently. Even over in D3D10 land the equivalents to UBO and Texture buffer objects are different and the MSDN article on them goes on (and on) about that too.

The purpose of an abstraction is to abstract things. This frees the hardware to implement things how it wants, and exposes this functionality to the user via the abstraction.

This is part of the reason why performance is not a part of the OpenGL specification. NVIDIA is perfectly free to use pointers to implement uniform buffers. If they don’t do it, it is because they choose not so.

ROFL. Open up an MD5 mesh from Doom3, a game a generation old, and see that 4 is NOT enough.

I said the typical way. Doom3 is not a typical game.

Even allowing stuff like MyData because you have to put it at then end of an UBO you are not matching bindless at all, because you cannot do:

Just break the struct into two separate uniform buffers. Yes, it’s not as “pretty” as the single struct of pointers, but it gets the job done. And that’s what matters.

How does that break the abstraction?

Because you can have indirection in the shader.

The pointer value can be stored in a uniform. The shader can read that value, cast it to a pointer, and access it as just another pointer.

Once you have pointers in GLSL, they can go anywhere. That’s why it is so important to keep them out.

Furthermore, it breaks the careful packing rules that allow UBO to be cross platform.

2.Fixed-function T&L. Not that it matters, if you look at it correctly it just means that the GL implementation provides a default shader (with name 0) and a variety of state from fixed T&L like GL_LIGHT, etc, are mapped to appropriate under the covers glUniform calls.
3. Selection: ahem, like 99.99% chance this is implemented all on CPU anyways nowadays.
4. GL_FEEDBACK, ie. stupid feedback, same story: 99% chance in software as well.

The default shader has to be written and debugged. And, if it is meant to be used, it must run reasonably fast. Thus, it must be optimized. A single massive monolithic shader won’t run fast; you have to dynamically built it from pieces of shaders for optimal performance.

Things that run in software still have to be written. That means you now need to write, debug, and maintain a software renderer. This is not a trivial undertaking.

Different GPU architectures, just like different CPU’s, will want to schedule instructions and break them down differently. So now to take a generic assembly interface a D3D driver most likely has buried in it a dynamic recompiler.

I don’t know what you mean by a “dynamic recompiler”, but whatever you would do for the assembly language, you would do for GLSL. And you still have to implement the compiler part, which is a non-trivial thing that the assembly version makes fairly trivial.

kRogue · October 16, 2009, 1:26pm

The purpose of an abstraction is to abstract things. This frees the hardware to implement things how it wants, and exposes this functionality to the user via the abstraction.

This is part of the reason why performance is not a part of the OpenGL specification. NVIDIA is perfectly free to use pointers to implement uniform buffers. If they don’t do it, it is because they choose not so.

Sighs. The entire point of having all these different ways of accessing data is that they implicitly state how you intend to access it. GL is not just about abstracting, which is nice up to a point. It is about using the 3D hardware well without needed to write for a specific GPU or understand every freaking hardware’s internals well.

I said the typical way. Doom3 is not a typical game.

[sarcasm] Right, Doom3 is not typical at all, completely wierd architecture, nothing but corner use cases. Bad model formats, etc. [/sarcasm] Give me a break.

Because you can have indirection in the shader.

The pointer value can be stored in a uniform. The shader can read that value, cast it to a pointer, and access it as just another pointer.

so what. The abstraction that a buffer object is just this: bytes managed by GL. To be an ass, then casting pointers in C is also a horrible abstraction break right? Actually I don’t think that Bindless lets you cast between pointer types at all, but this I have to check.

Furthermore, it breaks the careful packing rules that allow UBO to be cross platform.

Bindless also has strict packing rules, they are specified in the specification. Those packing rules guarantee already it is cross platform.

The default shader has to be written and debugged. And, if it is meant to be used, it must run reasonably fast. Thus, it must be optimized. A single massive monolithic shader won’t run fast; you have to dynamically built it from pieces of shaders for optimal performance.

Funny that, quite some time ago, I think it was nVidia, published shader code that gave fixed functionality via a shader. And really, look at what fixed function is, writing one shader for it is NOT a big deal at all (you can make a case requiring 8 shaders for the number of texture combiner stages, 1 shader for each possible count). Not to be nasty, that is far from rocket science.

Things that run in software still have to be written. That means you now need to write, debug, and maintain a software renderer. This is not a trivial undertaking.

giggles: both old style feedback and selection are NOT rendering anything, they fully focus on the vertex processing stage! Shoot, it is not very hard to implement using transform feed back and reading back from GL the buffer data. Please.

I don’t know what you mean by a “dynamic recompiler”, but whatever you would do for the assembly language, you would do for GLSL. And you still have to implement the compiler part, which is a non-trivial thing that the assembly version makes fairly trivial.

ROLF. Ok, you need to look up what a dynamic recompiler is. The bone dead simple answer is this: you feed it compiled code for one architecture and it outputs compiled code for another architecture with the understanding that it will schedule and such for that output architecture, and guess what, epic high chance that is what the D3D drivers have to take the D3D assembly and feed something into the GPU. do you think a texture fetch is just “one instruction” on a GPU? Between filtering and all that love it is a set of instructions.

Alfonse_Reinheart · October 16, 2009, 3:09pm

The abstraction that a buffer object is just this: bytes managed by GL.

No, it isn’t. That’s the abstraction that NV_vertex_array_range provides.

Buffer objects abstracts the location of the memory as well as allocation. Individual buffer objects have no relation to one another. They cannot refer to one another. And they do not live in any one particular place.

To be an ass, then casting pointers in C is also a horrible abstraction break right?

Which is why every competent book on C++ will tell you that having to do an explicit cast is generally a design flaw that you should avoid. C can’t avoid it, but C is a low-level language used for writing general-purpose applications.

Bindless also has strict packing rules, they are specified in the specification. Those packing rules guarantee already it is cross platform.

No, they do not. These packing rules guarantee it works on all platforms that support the extension. Platforms that cannot support these packing rules would be unable to support the extension.

To have rules that guarantee true cross-platform support, you have to actually talk to people who make other platforms. The Uniform buffers packing rules are a compromise that was created so that all platforms could implement them.

you feed it compiled code for one architecture and it outputs compiled code for another architecture with the understanding that it will schedule and such for that output architecture

There needs to be a special term for a compiler that happens to compile for an instruction set that the compiler itself is not executing?

This is all part of the optimization stage of building programs. You cannot deny that it is harder to write a GLSL compiler+linker than it is to write one for, say, the highest version of NV assembly. They may share the optimizer underneath, but the front-end part of the compiler is much more complex in the GLSL case.

do you think a texture fetch is just “one instruction” on a GPU? Between filtering and all that love it is a set of instructions.

That assumes that filtering is not done by specialized filtering units and must be done by the shader implicitly. Most of the time, this is not the case. On ATI hardware, depth accesses must do the depth comparison in the shader, as they do not have dedicated hardware for that comparison. Otherwise, you can expect texture operators to be single instruction.

Of course, them being single cycle is another matter entirely.

kRogue · October 17, 2009, 12:24am

Sighs, the idiocy is too much, here goes, really this is the last freaking time:

No, it isn’t. That’s the abstraction that NV_vertex_array_range provides.

Buffer objects abstracts the location of the memory as well as allocation. Individual buffer objects have no relation to one another. They cannot refer to one another. And they do not live in any one particular place.

F’ing BS dude. That buffer objects can’t refer to each other is a missing feature and bindless provides that. Where the buffer is really located on the card is so heavily abstracted in bindless anyways: the “GPU address” is most certainly virtual, etc.

Which is why every competent book on C++ will tell you that having to do an explicit cast is generally a design flaw that you should avoid. C can’t avoid it, but C is a low-level language used for writing general-purpose applications.

GLSL is sooo much more like C, and it should be since it is executed on every fragment and vertex. But wait it gets better! HLSL lets you cast will-nilly too. The reason: it makes performance development easier and helps stop the API and language from getting in your way. What do you think happens when you write assembly anyway?

No, they do not. These packing rules guarantee it works on all platforms that support the extension. Platforms that cannot support these packing rules would be unable to support the extension.

To have rules that guarantee true cross-platform support, you have to actually talk to people who make other platforms. The Uniform buffers packing rules are a compromise that was created so that all platforms could implement them.

more BS: that packing rules provided by bindless provide a means so that it will work on other platforms, now that the UBO rules look the way they do is for those GPU’s that are SSE-ish in their behavior. You can make a case for that. But guess what, that really, really does not matter. It is not exactly brain surgery to have an option for bindless to use UBO packing rules instead. But wait! Why did nvidia make those kind of packing rules? The answer is so that struct made in a 32bit packingare almost the same as struct for bindless. That was done as an effort for cross-platform.

You could make this case:
UBO packing rules are what they are to allow for SSE-like behavior. All that means is that the packing on UBO’s then supports:

SSE 32-bit packing.

it is not really cross hardware to fantasy wacky hardware which can not support that. So with that mind, if I my head worked the same deficient way yours did, I would also say UBO is not cross platform.

There needs to be a special term for a compiler that happens to compile for an instruction set that the compiler itself is not executing?

This is all part of the optimization stage of building programs. You cannot deny that it is harder to write a GLSL compiler+linker than it is to write one for, say, the highest version of NV assembly. They may share the optimizer underneath…

Sighs, here we go: make a D3D shader:

HLSL source –> D3D compiler –> D3D assembly

That D3D compiler does do work and source analysis.

Over in the driver:

D3D assembly –> Driver dynamic recompiler –> GPU instructions

Now, that middle step is needed in the D3D driver.

…but the front-end part of the compiler is much more complex in the GLSL case.

Giggles: the folks that pushed GLSL, 3DLabs, released an open source, use it anyway you like it license, GLSL front end.

Otherwise, you can expect texture operators to be single instruction.

Giggles, rolf. Now I know you have no clue, really no clue. That it is presented as one operation has nothing to do with what goes on. Do you think Intel’s Laurabee will do it in one instruction? Get real.

As a side note, given that the thread topic is on direct state access, and for the last 2+ pages has been a debate on bindless graphics, I am no longer going to take the troll bait on this.

kRogue · October 19, 2009, 12:56pm

Just a quick FYI for those that read through my bile:

NV’s bindless graphics DOES support pointer casting.

knackered · December 16, 2009, 3:07pm

that was fun - can’t believe I missed it all these months. kRogue makes a good case.
I hate the whole vertex attribute/uniform bollocks.