OpenGL Siggraph BOF

Michael_Gold · August 5, 2006, 8:18am

Originally posted by elFarto:
[b]One little thing I couldn’t quite figure out how it works is the relationship between GLSL samplers, image objects and sampler objects.

Could you (Mr Gold) possibly write a short piece of example code to show the relationship of the above, prehaps showing how to ‘bind’ an image object to a GLSL sampler.[/b]
No guarantee this won’t change, but the current thinking is roughly as follows. This code assumes a few self-documenting utility functions as hinted in the BOF slides.

GLimage image = gluCreateImage2D(format, width, height, levels);
GLsampler sampler = gluCreateSampler2D(GL_LINEAR, GL_LINEAR_MIPMAP_LINEAR, GL_CLAMP_TO_EDGE, GL_CLAMP_TO_EDGE);
GLshader shaders[2];
shader[0] = gluCreateShader(GL_VERTEX_SHADER, vshader);
shader[1] = gluCreateShader(GL_FRAGMENT_SHADER, fshader);
GLprogram program = gluCreateProgram(2, shaders);
GLuniformBlock uniformBlock = gluCreateUniformBlock(program);
GLint location = glGetUniformLocation(uniformBlock, "mySampler2D");
glUniformSampler(uniformBlock, location, image, sampler);
glBindProgramObjects(program, 1, &uniformBlock);
DrawSomething();

elFarto · August 5, 2006, 11:59am

Originally posted by Michael Gold: snip
Thank you very much.

Regards
elFarto

Korval · August 5, 2006, 1:51pm

But I think I’m safe in saying that the ARB as a whole is comfortable with the general direction of the 3.0 proposals from ATI and NVIDIA.
Fair enough.

My question now is this: what is L&M that isn’t the new object model?

The more I look at the new object model APIs, the more it seems to me that they are the new API. They are L&M, in effect if not in fact. Here’s why.

The new object model effectively depricates (and suggests layering) everything that doesn’t use it. You can’t use old-style objects with new-style ones, so there’s a specific pressure to use the new-style exclusively. And, of course, there’s the “minor” new features like (multiple) uniform buffers, the changes to VBO behavior, etc. Oh, and the obvious performance advantages from making objects immutable.

This sound suspiciously like what L&M was supposed to provide. According to the slides:

On the older HW the API may expose hard limit on number of buffers or it may merge buffers with associated performance hit (probably small if the layout of the buffers that can be bound to individual “bind points” is known beforehand), whichever they choose.
That’s not good. People are going to want to actually use that functionality. They aren’t going to want to querry a limit and see that it can only bind 1 uniform object; they’re going to want to design their code around having multiple uniform objects. I, for one, don’t want to have to write multiple paths just to set uniforms efficiently.

The API does need to expose future hardware, of course, but it should not do so in such a way as to suggest renderer design elements that work against performance in current hardware. Particularly so if there are easy ways to avoid it. It’s not as bad as requiring Vista in order to use the API, but it has similar effects. Either the feature goes unused for a good 3 years, or developers have to write multiple paths for performance reasons, or developers just accept slower performance on current cards.

It is possible to provide for multiple uniform buffers and so forth efficiently in current hardware. All it requires is a creation-time connection between an array of uniform objects and the program that this array will be used for. That would be when the mapping from each program uniform to the uniform in the individual blocks would be made.

Using Michael’s example code, my suggestion would be:

GLimage image = gluCreateImage2D(format, width, height, levels);
GLsampler sampler = gluCreateSampler2D(GL_LINEAR, GL_LINEAR_MIPMAP_LINEAR, GL_CLAMP_TO_EDGE, GL_CLAMP_TO_EDGE);
GLshader shaders[2];
shader[0] = gluCreateShader(GL_VERTEX_SHADER, vshader);
shader[1] = gluCreateShader(GL_FRAGMENT_SHADER, fshader);
GLprogram program = gluCreateProgram(2, shaders);
GLuniformBlock uniformBlock = gluCreateUniformBlock(program);
GLint location = glGetUniformLocation(uniformBlock, "mySampler2D");
glUniformSampler(uniformBlock, location, image, sampler);
GLprogramInstance instance = glCreateProgramInstance(program, 1, &uniformBlock, &vertexArrayBlock);
glDrawInstanceToFBO(frame_buffer_object, instance);

The glCreateProgramInstance would formally bind the list of uniforms and the vertex array (possibly a list of vertex arrays) to the program. This would be a creation-time thing, so the cost is only incurred once. Oh, and it could use one of those attribute objects instead of a straight function call, for extensibility sake.

Binding a uniform block (particularly multiple uniform blocks) to a program is a CPU-intensive operation. It’s going to involve a lot of string copies and so forth. Sure, an implementation can speed things along for the simple case of one block, one program and the block was created from the program (if the driver can detect that). But for the general case of several blocks, which is something that engine designers will want to incorporate at the engine level, this is heavy.

Obviously the spec is still in flux, but I would certainly appreciate it if they would consider making the binding of uniform arrays to programs a create-time thing rather than a runtime one (assuming it isn’t already). You should still be able to change the data in the uniforms as normal. It’s the string mapping overhead that I want to make sure never/rarely turns up at runtime.

oddhack · August 5, 2006, 2:29pm

Originally posted by Korval:
My question now is this: what is L&M that isn’t the new object model?
Removing redundant and obsolete paths. The entire fixed-function pipeline and all the machinery and state that goes with it. Immediate-mode geometry specification. And so on.

That stuff is to be supported by a compatibility layer running (conceptually, at least) above the L&M profile. So older apps that require the backwards compatibility that OpenGL has always offered would still work, but using the entire “3.0” API stack. New apps written to the L&M profile would not need the compatibility layer.

Jon

Komat · August 5, 2006, 3:43pm

That’s not good. People are going to want to actually use that functionality. They aren’t going to want to querry a limit and see that it can only bind 1 uniform object; they’re going to want to design their code around having multiple uniform objects.

I would be more happy with fixed limit I can query than with driver doing some “clever” things behind my back (e.g. sw emulation of glsl shaders if hidden limit is hit).

At time the new API would be reasonably usable (finished API & solid driver support) I would likely primary target dx10 level features with fallback for dx9 hw in which case I would have to write several code paths anyway. Or, if that happens in really short time, I would target dx9 level hw in which single code path optimal for it would be sufficient.

Actually on the dx9 level hw I would probably create only one uniform block shared by all shaders of the same type (vertex, pixel) and managed by me, like the global environment of the ARB_*_program extensions which is thing I greatly miss from the GLSL shaders.

Of course your priorities may be different.

Binding a uniform block (particularly multiple uniform blocks) to a program is a CPU-intensive operation. It’s going to involve a lot of string copies and so forth.

There is no need for any string manipulations during block binding. The driver can create global string_name->some_id mapping table during shader compilation or during uniform creation and use the strings only on api entry points. The required fixups can be then stored in several fixed tables and hashes created at shader/block creation time.

If another block with same format is bound instead of already existing block, which is likely the most common use case, no additional fixup other than copy of the uniform values on the dx9 hw should be necessary.

The DX10 avoids this overhead by exposing explicit constant buffer slots with explicit variable layout within each buffer so final variable storage location is known at shader compile time. That would be also easily usable for merging the constant buffers on dx9 level hw.

Jan · August 5, 2006, 4:31pm

It seems that the new object model is the biggest and most important change to OpenGL 3.0. However, i think one shouldn’t be reading to much between the lines. It’s not the time to ask for all the details, because up to now we can only grasp on how the final API will look like and i don’t think we should bother the guys working on it, with too many complicated questions.

I for one, am very happy, that ATI and nVidia have taken this on. It seems that finally some big change is going on. However, we all know, that the guys at the ARB and Khronos are very skilled, so i don’t think that we need to be worried.

So, let the guys work (and let them enjoy their weekends ). We can discuss the details, when we actually got a spec.

Jan.

knackered · August 6, 2006, 7:13am

This isn’t MSN Messenger, they don’t have to keep replying. I’m finding this discussion really interesting, but have nothing to contribute.

Michael_Gold · August 6, 2006, 10:10am

Must… reply… can’t… resist…

On the question of uniform blocks: efficiency is the whole point. The layout of uniforms is fixed at program creation, so there are no string operations required at bind.

Flexibility is also a goal. You may create custom uniform blocks in order to share uniforms between programs, and/or swap a subset of uniforms without modifying the rest. This must be done prior to program creation in order to retain efficiency.

Binding will fail if the uniform block(s) is/are not compatible with the program.

oddhack · August 6, 2006, 9:45pm

Originally posted by Michael Gold:
Must… reply… can’t… resist…
Apparently this is MSN Messenger, then… :eek:

Hampel · August 7, 2006, 3:42am

Are there any anticipated dates for releasing the first 3.0 LM proposal/specification and when to expect first driver impls for that proposal/specification from NVidia and ATI?

mrbill · August 7, 2006, 8:00am

Originally posted by Hampel:
Are there any anticipated dates for releasing the first 3.0 LM proposal/specification…
Not on the slides, but stated at the BOF - Siggraph 2007 is the goal.

-mr. bill

Korval · August 7, 2006, 2:22pm

The layout of uniforms is fixed at program creation, so there are no string operations required at bind.
Now I’m really confused…

As I understand it, in hardware, uniforms are just a flat “array” of registers, numbered 0 through N-1. When you link a program, it assigns each uniform variable name to one or more uniform registers. So, the mat4 declared with the name “localToWorld” gets, say, uniforms 0-3.

However, a second program may declare a mat4 uniform with the same name, but because the order of stuff may be different, it gets uniforms 6-9.

When you build your custom uniform block, you say that it has a mat4 named “localToWorld”. If you bind this uniform block to both programs (as it is a shared uniform), it can’t have the “localToWorld” matrix in both hardware uniforms 0-3 and 6-9. So it seems like one of two things needs to happen.

One, you, at bind time, determine the layout of where each uniform as defined in the uniform blocks get assigned based on the program. So, you do a lot of searching. You find the mat4 that has been named “localToWorld”. This isn’t onerous, but it isn’t free either.

Two, you, at bind time, patch the program by defining the layout based on the uniform blocks. So you walk into the program and move all of the 0-3 to 6-9 or wherever the uniform blocks say that things get laid out. But if programs are stored in GPU memory, this can’t be a quick operation.

So, what exactly am I missing here that makes object binding not slower than it could be?

Binding will fail if the uniform block(s) is/are not compatible with the program.
How is compatibility defined?

BTW, as a matter of interest, how do you deal with uniforms that are structs? Since the program defines what the structs are, do you not need the linked program to build that uniform block?

Additionally, it might be a good idea to be able to, instead of just creating a default uniform object from a linked program, that it creates a mutable attribute object that would create a default uniform object. That way, you can edit the attribute object (removing shared uniforms, for example. Assuming things in an attribute can be removed) before creating the per-instance uniform block.

One last thing: format objects.

This is something I didn’t notice on my first reading, but you were talking about objects for things like image formats, right? GL_RGB8, etc? Presumably this exists so that you can ask for an available image format that corresponds to some set of parameters, rather than just say, “Give me an RGB image of some kind.”

OK, one thing after the last: display list objects.

I’m thinking that, with the concept of geometry-only display lists as well as vertex array objects, what you really want is just a “derived” class of vertex array object. An object that is totally compatible with VAOs, but they just have a different method of creation (rather than with buffers and so forth). That sounds like a really good idea.

This sounds like extension territory, though; it’s really complicated and is something that probably shouldn’t hold up the new object model.

[ edit Because I keep coming up with stuff based on the new object model ]

Something just occured to me. Because all images are alike, it is therefore possible/reasonable to take a “renderbuffer” (an image created from a format that, I guess, suggests being a render target as its primary function?) and bind it as a texture to a sampler? Will there be combinations of these bindings that don’t work, whether binding a depth sampler to a non-depth texture or just the wrong format to an image?

Komat · August 8, 2006, 1:35am

One, you, at bind time, determine the layout of where each uniform as defined in the uniform blocks get assigned based on the program. So, you do a lot of searching. You find the mat4 that has been named “localToWorld”. This isn’t onerous, but it isn’t free either.

From what was told by Michael I assume that when you create the shader object with uniform blocks, you will need to specify format of those blocks by providing an attribute objects used to create them or created instances of that blocks or something similiar. This way the program compilation can determine that content of specified block (or part of it) will be stored from specific offset in array of registers (or in specific buffer slot on dx10 hw). That information will never change after the program has ben created so it can easily be stored in table (for dx9 hw the table for buffer with format id X for shader Y might be something like: copy range (Z-W) from buffer to constant array starting at offset V). On dx10 hw the bind will associate buffer with slot, on dx9 hw it wil copy part of buffer content to that specified location without need to search for anything.

But if programs are stored in GPU memory, this can’t be a quick operation.

If program modification is required for some reason, there is no need to do it directly on the GPU. The driver always has system memory copy of the binary program. It can do all modification in system memory and upload entire new program to the GPU.

[b]

[quote]Binding will fail if the uniform block(s) is/are not compatible with the program.
How is compatibility defined?
[/b][/QUOTE]Probably by having same format of attribute object/uniform block as that one used during program creation.

Michael_Gold · August 8, 2006, 8:42am

Originally posted by Korval:
So, what exactly am I missing here that makes object binding not slower than it could be?
You are making a lot of assumptions. First off, the model you have described is not the only possible implementation.

The layout of the uniform block is known at the time the program is compiled, so the program can hard-code the relative offset of each uniform. Since the program is immutable and will only work with a uniform block of this layout, the code never needs to change. So our task at bind time simply becomes: put the uniform block where the program can find it.

If the hardware works as you describe, the driver simply copies the uniform block into the register bank. No searching is required, the uniform names are long forgotten at this point, everything is already in the proper order.

How is compatibility defined?
The uniform block must match the exact layout expected by the program. For simplicity sake, lets assume you need to bind the original block used at program creation, or a clone of that object.

BTW, as a matter of interest, how do you deal with uniforms that are structs? Since the program defines what the structs are, do you not need the linked program to build that uniform block?
This is no different from any other data type; you need to create the uniform block from the program, or you need to create a uniform block which matches the data types expected by the program.

Additionally, it might be a good idea to be able to, instead of just creating a default uniform object from a linked program, that it creates a mutable attribute object that would create a default uniform object. That way, you can edit the attribute object (removing shared uniforms, for example. Assuming things in an attribute can be removed) before creating the per-instance uniform block.
Problem is we don’t want the layout of the uniform block to change after the program is linked, for the reason described above.

I’m not prepared to talk about format objects or display lists at this time.

Something just occured to me. Because all images are alike, it is therefore possible/reasonable to take a “renderbuffer” (an image created from a format that, I guess, suggests being a render target as its primary function?) and bind it as a texture to a sampler?
An image may be used as a render target, or a texture, or both. This usage must be specified at creation time and will be strictly enforced. This is important because the implementation may make storage decisions based on usage, and we can do a better job if we don’t have to guess.

Korval · August 8, 2006, 3:55pm

The layout of the uniform block is known at the time the program is compiled, so the program can hard-code the relative offset of each uniform. Since the program is immutable and will only work with a uniform block of this layout, the code never needs to change. So our task at bind time simply becomes: put the uniform block where the program can find it.

This makes sense, but what about uniform block sharing? I’m getting the impression from some of the things you’ve said and from careful reading of the slide on uniforms that the way to do it is like this.

You build the shared uniform block before building the programs themselves. You then pass this shared uniform block (or blocks) to the linking function when you’re creating the program.

At which point, you have made a binding (no pun intended) contract with the programs that these particular uniform objects (or, as you say, a clone of those objects) will be bound whenever the program itself is to be used. When you use the program to generate its uniform block, what it generates are all the uniforms that are not satisfied by those uniforms used at creation time.

This mechanism seems like it solves all of the problems I described. Hardware that doesn’t natively support the construct would need a few block mem copies, but that’s hardly onerous.

This usage must be specified at creation time and will be strictly enforced.
Hurray for strict enforcement of “hints”!

BTW, if the hoped-for timescale for the 3.0 API is SIGGRAPH '07, what’s the timescale for the new object model?

Michael_Gold · August 8, 2006, 6:28pm

Your understanding is basically correct, modulo some minor details to be ironed out.

The new object model is integral to 3.0. We may roll out some of the new objects as extensions to 2.x before the final spec is complete; this will give us an opportunity to prove the functionality before the interface becomes “immutable”.

Hampel · August 9, 2006, 2:40am

Will 3.0 still support immediate mode rendering? or something like the proposed “pseudo instancing”?

Jan · August 9, 2006, 3:26am

In the slides it is mentioned, that Immediate Mode will be available through a seperate layer, that works on top of 3.0s Lean&Mean Layer. So, it won’t be “natively” supported, but it will be supported.

In fact, that’s essentially what the drivers do today, anyway, if i am not completely wrong.

I don’t know about instancing, but i do hope, that that feature will be available.

Jan.

psyduck · November 3, 2006, 3:09pm

Do not assume that a draw call will accept a list of objects. We have published no such API. I dont like your proposed API any more than you do.
Can you guys show us how would a draw call look like in this new object model? Or maybe give us some hints?

Cheers on your great work.

Daniel

psyduck · November 3, 2006, 3:53pm

May I just add a question.

I’ve seen a lot of specifications and patterns on creating data on this new object model. But almost none on manipulating the data itself.

Although most objects will hava an immutable structure, the contents are still manipulable, right? If so, are there any patterns on manipulating OpenGL data?

Cheers.
Daniel