Pipeline Newsletter Volume 4

(could someone lend me their skills to keep posts short? :slight_smile: )

Rob, while your b) is true, blocking vs. non-blocking can be a make-or-break depending on whether user and implementation agrees on interpretation. This particular feature is a performance promise, iff the user interprets the wording as “shall not block”. In that case the user “knows” that s/he can write data at around bus-speed (whether the bus is local PCIe or Token Ring :slight_smile: ).

Michael, thank you for ACKing the “may” issue.

As for usability - to be able to map/flush sub-ranges of a buffer I consider potentially very useful (from a performance POV, obviously) instead of having to deal with multiple buffers.

I might want to request consideration for the feature others have requested - rebased indices, so that one could use f.ex. one index buffer, and one vertex buffer containing many “frames” of some (on-CPU calculated) animation where at time-of-use one could say “index 0 refers to vertex[4711], normals[5472], colors[0]”. I don’t know how useful it’d be in reality, but I can indeed see uses for it. A precomputed “path” for a cannon-tower on a tank turning. A “walking” or “running” sequence for a character…

However (didn’t you expect it :slight_smile: ), for this with sub-mapping/flushing to be truly useful I think I’d have to know about the alignment restrictions, as it is stated in the article “This option allows an application to assume complete responsibility for scheduling buffer accesses”. The only piece of software that can tell me about (optimal) buffer alignment is the implementation.

Consider the following case if I didn’t know about mapping alignment requirements:

I create a large buffer. Let’s say I only use it for geometry data. I write a batch of vertices that ends in the middle of a “page” (*). I “flush” (perhaps even unmap is still a requirement?) this range to tell the implementation “you go ahead, I’m done with this range”, and issue a draw call. I then merrily continue to fill the buffer starting just after my previous end position (mapping that range first, if required) - that starts in the middle of the last “page” of the previously “flushed” area.

If this buffer (memory area) is truly “mapped” over a bus (PCI-ish), it means that either the implementation needs to take a private copy of this last page and place it somewhere else in the card’s memory (performance hit, not to mention requirement to fiddle with the GPU for this non-sequential “jumping around” in physical on-card memory when reading what should be sequential memory), or it needs to map the whole last “page” of the previous batch as writable again into my process’ address space - thereby giving me write-access to the data I already told it “I’m done with this” and allowing me to potentially interleave (bad) writes to an area the GPU is busy reading.

An even worse scenario would be something like:

  • batch 1 writing “pages” 0-1.5
  • batch 2 writing “pages” 3.5-5
  • batch 3 writing “pages” 1.5-3.5
    as it could require both “page” 1 and 3 be mapped for the third batch, while at the same time the GPU will be reading them (both).

I suspect this is the “room for programs to screw up” I read between the lines, but I think it can be improved to prevent this - while still providing maximum possible speed - by the simple addition of the following:

Had I on the other hand been able to query the implementation about alignment requirements, I could “line up” my next write to the next “page” boundary and start writing in a fresh “page”.

I therefore consider this alignment information vital for “proper” (in as-fast-as-possible, which seems to be the stated goal with) use of this feature, to be able to use it without creating neither full nor partial stalls at any level. That is, assuming I haven’t misunderstood something before this short analysis.

As for the problem of not being able to save a binary blob of a compiled program; why not simply reserve some space towards the beginning of the blob for the implementation to play with, say 8 or 16 bytes (heck, save it like a PASCAL string, prepending the info required to verify compatibility with a byte telling how large the “private” data is) where it can save e.g. PCI ID and/or driver version? Such a small change shouldn’t have to take more than a few minutes to implement (for each vendor), it would allow freedom of implementation (256 bytes is likely more than enough to verify compatibility) and it would be a user-mode side thing only with no need to send this verification data over the bus. Compare it to prepending a TCP packed with an IP header if you like (even that this header could, if you really need it, be variable size). That way vendors they can verify compatibility of current hardware with the pre-compiled blob, and simply report success/failure, and in case of failure I need to recompile the program. It seems so easy that I’m starting to fear I’m missing something obvious. Am I?

(*) I used the word “page” in the loosest sense. For an implementation on NT-based systems this alignment would be a “section” (64KB alignment), for a GLX implementation it’d likely be a host page. For e.g. Linux or FreeBSD with local h/w, I haven’t got a clue what they use as alignment for mapping h/w to virtual memory. :slight_smile:

Oh, I almost forgot:
“We’ll try our best to get the wording right in the final spec. If an ambiguity slips through, it will be neither the first time nor the last.”

I know, that’s one of the reasons I bruoght it up. :slight_smile: Will we get a chance to have a look at a “release candidate” of the spec. before it’s carved in stone? More eyeballs and such…

For the answers to these and other questions, please come to the OpenGL BoF at SIGGRAPH.
Why not have someone set up a microphone and record the presentation, compress it in an MP3, and put it online for people to download?

rebased indices, so that one could use f.ex. one index buffer, and one vertex buffer containing many “frames” of some (on-CPU calculated) animation where at time-of-use one could say “index 0 refers to vertex[4711], normals[5472], colors[0]”.
That’s not what they asked for. What was requested was a parameter to the “lpDraw*” functions that takes an integer offset to be applied to all indices before indexing into the various arrays.

while at the same time the GPU will be reading them (both).
If you map memory for writing that the GPU is reading from, you incur a stall (unless you map it using the all-purpose Get-out-of-jail-free card of “non-serialized access”). That’s what mapping means.

Now, because the writing range you specify is in bytes, not pages, all the GPU needs to worry about is whether or not the bytes you’re mapping match bytes that it has been told to read from. So even if Batch 3 is writing to page 2 after Batch 1 started reading from it, the GPU doesn’t need to worry unless it the address range for Batch 3 is actually in the middle of Batch 1.

That is, you never need to know about pages; that’s the responsibility of the implementation. The only thing you need to make sure of is that you never write outside the bounds that you mapped.

Originally posted by tamlin:
[b]
As for the problem of not being able to save a binary blob of a compiled program;

It seems so easy that I’m starting to fear I’m missing something obvious. Am I?
[/b]
I do not think that there is technical problem with implementing such functionality. I think it is more about the need to decide the best way to integrate the retrieval/set thing into the API, deciding how it will interact with rest of the API in various situations or which objects and what parts of them are blobable and so on. This needs to be well-thoughtout because it might be part of the API for long time.

Why not have someone set up a microphone and record the presentation, compress it in an MP3, and put it online for people to download?
I second that. Or go live!

BTW, what happened with the pod-cast poll?

Originally posted by knackered:
well it looks like they’re just like d3d vertex declarations…
http://msdn2.microsoft.com/en-us/library/bb206335.aspx

Yes, provided that the VAO also has some sort of equivalent of D3D Vertex Streams. It’s an essential part. The newsletter doesn’t give any hints about it, unforunately.

Originally posted by Jon Leech (oddhack):
[quote]The only thing, that comes to mind right now, is that drawcalls do not include an “offset” parameter for the indices (offset added to each index, not the thing the “first” parameter is used for).
As currently defined, you specify an offset when attaching a buffer to the VAO (IOTW, it is a mutable VAO attribute). I didn’t have room to go very deeply into the individual object attributes and behaviors in that article.
[/QUOTE]What kind of entities serve as the buffer attachment points in VAO?

In particular, are they more like array names in today’s OGL?
Or are they like Vertex Streams in D3D9 Vertex Declaration?


Explanation for those readers, who are less familiar with The Dark Side:

The “Vertex Stream” in D3D is a fancy name for a simple thing: a subset of active vertex attributes. When you are defining a Vertex Declaration in D3D, each listed vertex attribute is being assigned a number. Those attribs to which you have given the same number, together comprise a Vertex Stream. Note that in the most common usage, you only have one stream per Vertex Declaraion, which is equivalent of using single interleaved array in GL. Of course, multiple Vertex Streams do have their uses, just like non-interleaved arrays in GL.

In D3D9, you bind a Vertex Buffer to a Vertex Stream (of currently bound Vertex Declaration). Also, an offset is provided by user, to indicate where in the buffer the stream data starts. Note that with that functionality, another D3D9 feature - the infamous “base index”, is redundant, because instead of changing the base index you could just re-bind the VB at different offset. It’s a tradeof(?) of one API call for one API call. Unlike in GL, where we’d have to re-bind each attribute in a separate call.

Also, D3D9 instancing works by assigning “frequencies” or “dividers” to Vertex Streams. This is an example of case where it is necessary to have more than one stream.

In D3D10, they did change some related terminology, but the Vertex Stream is still there.

I personally think the concept of Vertex Stream makes hell a lot of sense. Many things can be said about D3D, but in this part, I think, they just got it right. If the new VAO doesn’t get equivalent of D3D Vertex Streams, I predict resurrection of “I wanna base-index extension” threads…

Note that with that functionality, another D3D9 feature - the infamous “base index”, is redundant, because instead of changing the base index you could just re-bind the VB at different offset.
No, that’s not how it works. If the impetus for the index offset feature was simply the number of API calls, it wouldn’t be an issue. It’s not the API for offsetting a buffer-bind point; it’s the internal stuff that the implementation needs to do to make it work.

The general argument is that, whenever you change an attribute pointer, you need to do some validation work to make sure that the pointer, stride, etc work. And presumably, this work is non-trivial. Thus the purpose of the offset is to avoid rebinding buffers.

Vertex streams don’t change this.

Also, D3D9 instancing works by assigning “frequencies” or “dividers” to Vertex Streams. This is an example of case where it is necessary to have more than one stream.
A feature that is effective dead. Nowadays, particularly with Longs Peak, the expectation is that you will use the instancing feature of the API. It will pass a number, 0 through n-1, where n is the number of instances, to your vertex shader. From there, you will figure out what you need to do.

It’s up to the implementation to decide how to make this work. If it can do D3D9-style instancing, then it will create a buffer with numbers 0-n-1, and the shader compiler will turn the built-in variable quietly into an attribute. If it works like D3D10 instancing, then it will natively handle this case. Otherwise, it will simply issue multiple draw calls, changing the uniform under the hood for each call in the most efficient way the implementation can.

I personally think the concept of Vertex Stream makes hell a lot of sense.
In what way?

It is a perfectly meaningless concept. It gives no actual benefit, except for the kind of older hardware that actually had vertex streams. It’s exactly the kind of thing an abstraction API should abstract.

I can almost guarantee that LP will not expose streams to the user. From the user’s perspective, they’re just a layer of bureaucracy that provides no actual benefit that cannot be achieved in some other, more effective, way.

Korval,

About rebasing indices; OK, I stand corrected. Then let me add this request now. :slight_smile:

While it seems vertices and normals indeed should go hand-in-hand (and therefore it could be seen reasonable that rebasing indices for one rebases the other pointer too), I see no reason for this to be true for neither texture coordinates or colors, just to name two. Quite the opposite. I see much reason to be able to keep e.g. the texcoords or the vertex colors the same over a surface while deforming the surface.

Just think about (a precomputed animation) deforming any surface. The vertices move, and therefore the normals change. But is it equally obvious that texture coords change, or colors change? I don’t think so. And in the name of consistency, as I here only singled out two out of four attributes, would there be any harm in a design where all attribute “pointers” can be rebased? What about user-specified arrays (objects)?

Imagine a 100 frame animation with 10000 vertices (I just pulled those numbers), where you for simplicity had saved 1M verts and 1M normals in two LP array objects. What if I then had 1 color, a few texture-coords, and 3 other arrays for each vertex, but they could remain static for the whole animation.

What’s the point in forcing allocation and upload of 100 identical copies into 100 times as large buffers, when a single copy in each buffer [EDIT: was “a single buffer”] would suffice - if I only could rebase indices per-array?

Imagine 1 color (4 bytes), 3 texture coords (36 bytes) and let’s say 3 user arrays of 3 floats each (36 bytes) for each vertex. That’s 72 bytes/vertex. Now compare 7210000 = 720k, vs 721e6=72 million bytes (not counting overhead). Let’s round off and say we compare 720KB vs 72MB.

Sure, the vertex coords+normals would require 24MB, but why add 72MB on top of that if 720KB could suffice?

Again, I don’t know. Maybe I’m just dreaming up scenarios noone would ever use. Then again, maybe someone would…

Komat, while I think I understand your concern, isn’t there (to be) but a single entry point to upload an already compiled program blob (of a specific type?), and is not that entry point to return success/failure?

If there were/are other ways also to upload already compiled programs I would too be vary, but are there?

I agree it needs to be well designed, as it would stay with us for (hopfully) 25+ years (just as OpenGL 1.0 can still be used, but perhaps 1.1. is the display “1.0 was wrong, we didn’t think enough” - just to prove the point).

According to Pipeline Newsletter 4, Image data will be defined using lpImageData[123]D, and that’s a very bad idea IMHO.

I propose using a single lpImageData signature. Since Image dimension is part of Image Format object, it is redundant to specify its dimension again. By changing offset, width, height, depth to GLint* offsets and GLint* sizes, we can have one function only.

Another point, I added an ‘index’ parameter to specify which cubemap face ( or array element ) are we dealing with. I believe this would be somewhat similar to the ‘target’ parameter in OpenGL 2.1.

This is the result:

  
void lpImageData( LPimage image,
                  LPint index, // An integer, or: CUBE_POS_X, CUBE_NEG_X, CUBE_POS_Y, ...
                  LPint miplevel, 
                  LPint *offsets,
                  LPint *sizes,       
                  LPenum format,
                  LPenum type,
                  void* data )

It’s important to note that ‘index’ would be zero in most cases ( except for cubemaps and arrays ).

This would be cooler, more elegant, generic, lean & mean, KISS, whatever, IMHO.

If you guys find it absolutely necessary to specify a dimension on ImageData calls, I believe it’s a better idea to add a ‘dimension’ parameter instead of providing 3 (or more?) different functions:

  
void lpImageData( LPimage image,
                  LPenum dimension, // 1D, 2D, 3D, ... could be an integer instead. It's another option.
                  LPint index, 
                  LPint miplevel,
                  LPint *offsets,
                  LPint *sizes,       
                  LPenum format,
                  LPenum type,
                  void* data )

Well, this is not really cool but would do the trick.

Gimme feedback please.

Best regards,
Daniel

Michael Gold asked about API completeness.

One thing I just came to think of… Now, I haven’t really thought this through, why it may be irellevant, but is there a (planned) facility to ask the implementation if a specific range in a mapped (but flushed) buffer is completed (used and won’t be needed anymore by the implementation). Also, is there a way to wait for it to be completed.

For comparison, as most here are familiar with Win32 API, think of it like TryEnterCriticalSection and EnterCriticalSection, only that this query would take a buffer name, an offset and a size.

An added bonus using this approach (buffer name + stuff) as opposed to virtual-address+size, is that one wouldn’t have to apply 64-bit pointers to the API.

Consider it a brainstorming idea.

EDIT: This would only be defined behaviour if querying/waiting from a single thread. Multiple threads waiting for such a resource would invoke undefined behaviour.

What if I then had 1 color, a few texture-coords, and 3 other arrays for each vertex, but they could remain static for the whole animation.
So what if you did?

LP doesn’t care one way or another.

Michael said you can change the offsets in a live VAO for any particular bound array. Isn’t that enough? I mean, I assume that being able to alter the offset means that it will be reasonably performant, so I don’t see what the problem is.

And if you need to actually change one of the buffer objects, then make a new VAO. They’re small, light-weight, and it is expected that an application will be creating thousands of them.

According to Pipeline Newsletter 4, Image data will be defined using lpImageData[123]D, and that’s a very bad idea IMHO.
IMNSHO, it’s a much worse idea to cross-post.

Korval wrote:
Michael said you can change the offsets in a live VAO for any particular bound array.
Oki. I must have missed that (perhaps it wasn’t in this thread). As it provides the functionality, I’m cool with that.

Originally posted by tamlin:
[b] Michael Gold asked about API completeness.

One thing I just came to think of… Now, I haven’t really thought this through, why it may be irellevant, but is there a (planned) facility to ask the implementation if a specific range in a mapped (but flushed) buffer is completed (used and won’t be needed anymore by the implementation). Also, is there a way to wait for it to be completed.
[/b]
Not in the buffer object API; but just as you can use fences under GL2.x to sort out these kinds of issues, you can use sync objects under LP to meet the same goal.

Can you describe some usage patterns you would be likely to employ in real world code?

After reading the newsletter again, i stumbled upon this piece:

When an image is bound to an FBO attachment, the format object used to create the image and the format object associated with the attachment point must be the same format object or validation fails. This somewhat draconian constraint greatly simplifies and speeds validation.

Well, i can live with the fact, that the format object needs to be the same. However, i think this should be an implementation detail, that is handled behind the scene and not something a developer needs to worry about.

If i need to pass the same format object when i setup the FBO and the texture to render to, this makes my code much more complicated. In the end i will simply write some layer, that handles format-object creation and for each format-object to be created it calculates a hash-value and checks, whether such a format-object is already created. If so, it returns the same handle (thus the same object).

Now, since we all agree that object-creation is allowed to be a bit slower but object-usage should be as fast as possible, this approach is ok. However, i don’t know, why I should care about this? If the driver wants to speed up validation by only accepting identical format-objects (not “equal”) then, in my opinion, the driver should take care to actually return handles to the same object, if the app requests an object, that is equal to some earlier created one. It should be easy to implement and it will reduce the burden from the application writer.

How the driver internally does the validation is not my responsibility. I think the spec should say

When an image is bound to an FBO attachment, the format object used to create the image and the format object associated with the attachment point must define the same format or validation fails.

Since the objects will be reference-counted anyways, returning several times handles to the same (immutable) object should not introduce any problems.

Jan.

Another two thoughts for some future version of OpenGL. Although the second one could be added right now.

------- #1 -------
A debug context was mentioned in this newsletter. I started to wonder if there will be some kind of “pure hardware” context - something that would guarantee that I won’t hit software fallback. Or some kind of “performance” context where I can hit software emulation but not with much performance impact.
For full emulation I would use Mesa anyway, because an “emulation” context would still be vendor-specific (NVIDIA - no ATI extensions emulated and vice versa).
Better yet - instead of emulation just buy the cheapest GPU from the generation you’re interested in. It would still run faster than emulation :slight_smile:

Note that it’s not easy to define “pure hw” context. For example - Radeon X800 emulates gl_FragCoord. It’s an emulation, but it’s also pure hw emulation. On the other hand it’s not as precise as built-in gl_FragCoord.

------- #2 -------
My second thought is on the GL_RENDERER thing. I have no idea how it’s going to be in LP, but I think it would be good to have it “reversed”.
Instead of asking for renderer, you give a GL_RENDERER string and receive an answer if it’s compatible. Such string would be rather general (“GeForce 6” or “NV40” for example - no “6800 LE”, “9800 Pro” thing).
Well, the classic GL_RENDERER is of course still required - you have to name a renderer when you display it to user in a combobox with available renderers.

In general, I think we should not ask for driver version and other stuff. It should be all put in one long string so you can put that information into log file or crash report. It could include driver version, release date and other stuff.

When an image is bound to an FBO attachment, the format object used to create the image and the format object associated with the attachment point must define the same format or validation fails.
I agree. Indeed, I agreed to the point of assuming that this was what the newsletter was saying. I didn’t realize that it meant literally the same format object pointer.

Originally posted by Korval: [b] [quote]Note that with that functionality, another D3D9 feature - the infamous “base index”, is redundant, because instead of changing the base index you could just re-bind the VB at different offset.
No, that’s not how it works. If the impetus for the index offset feature was simply the number of API calls, it wouldn’t be an issue. It’s not the API for offsetting a buffer-bind point; it’s the internal stuff that the implementation needs to do to make it work.

The general argument is that, whenever you change an attribute pointer, you need to do some validation work to make sure that the pointer, stride, etc work. And presumably, this work is non-trivial. Thus the purpose of the offset is to avoid rebinding buffers.[/b][/QUOTE]Changing binding offset, in single API call, adds a value to a group of pointers.
Changing base index, in single API call, adds a value to a group of pointers.

In situation with only single Vertex Stream, these above actions are interchangable, and that was the sole point of the text you’ve quoted. Your speculations about how big is the implicit difference in the validation work, don’t bring anything meaningful or relevant.

Originally posted by Korval:
[quote]Also, D3D9 instancing works by assigning “frequencies” or “dividers” to Vertex Streams. This is an example of case where it is necessary to have more than one stream.
A feature that is effective dead. Nowadays, particularly with Longs Peak, (…)
[/QUOTE]I haven’t postulated the idea you’re trying to dismiss here.

Originally posted by Korval:[b]

[quote]I personally think the concept of Vertex Stream makes hell a lot of sense.
In what way?

It is a perfectly meaningless concept. It gives no actual benefit, except for the kind of older hardware that actually had vertex streams. It’s exactly the kind of thing an abstraction API should abstract.

I can almost guarantee that LP will not expose streams to the user. From the user’s perspective, they’re just a layer of bureaucracy that provides no actual benefit that cannot be achieved in some other, more effective, way. [/b][/QUOTE]You are saying such a rubbish, that I can almost guarantee that you have false (if any) understanding of the concept you are commenting. Vertex Streams are not related to any “older hardware”, because they are not hardware feature at all. Vertex Streams are pure API logic, naturally reflecting the way how we use vertex data.

If you consider all kinds of data which we associate with active vertex attributes, you could distinguish two cathegories:

In one cathegory we have: vertex buffer handle, vertex buffer offset, vertex stride, vertex frequency divider.
They are almost always used in such way that multiple vertex attribs are given the same value.

In the other cathegory we have: data type, data offset, semantics (and several other D3D idioms).
For them, in contrast, such value sharing wouldn’t make sense.

Let’s focus on the 1st cathegory. Some of those shared properties also happen to be mutable. And when we want to change any of them, it’s obviously preferable way to do it in single call, for all the vertex attribs in a group that share the property at once. In order to be able to do so, you need to be able to identify the group. Vertex Stream is such identifier.

Without this “meaningless” concept, you’d have to set the property for each vertex attrib in a separate call. At least with this part you should be familiar, since that’s what we do in GL today, using batches of glXXXXArrayPointer calls.

I recommend you to learn a bit how the related parts of D3D9/10 work, and to try to imagine consequences in the API if you removed the Vertex Stream “burden” from it.

I believe it meant the same format object handle, otherwise they wouldn’t use the word draconian.
I agree that it would make the driver faster and simpler, and I should be able to take advantage of that - after all, FBO’s and renderable textures are closely tied in my renderer anyway, so it would be no problem for me to supply the exact handle.

In situation with only single Vertex Stream, these above actions are interchangable, and that was the sole point of the text you’ve quoted. Your speculations about how big is the implicit difference in the validation work, don’t bring anything meaningful or relevant.
No, it does bring something meaningful and relevant: performance.

Maybe I’m not a lazy programmer, but as long as it’s fast, I don’t care if I make 1 function call or 7 to change the base offset for a bunch of VAO parameters. It is no different to me one way or another.

The impetus for the feature as described (an index offset in the glDraw* call) was performance, not convenience. If there is no longer a performance concern, then it’s merely a matter of API convenience.

I haven’t postulated the idea you’re trying to dismiss here.
Yeah, I was jumping ahead. See, the only real-world use for the D3D implementation of “frequencies” and “divisors” is for instancing. Since we can do instancing in a much better way now, there’s no point to the feature. It’s a feature in need of an application, and until one shows up, it is 100% irrelevant.

Vertex Streams are pure API logic, naturally reflecting the way how we use vertex data.
So you admit that this feature is nothing more than syntactic sugar? Then how can you possibly describe it as “an essential part?”

In one cathegory we have: vertex buffer handle, vertex buffer offset, vertex stride, vertex frequency divider.
They are almost always used in such way that multiple vertex attribs are given the same value.
Maybe the way you work. For someone who may not want to interleave some of his data (possibly for memory/packing concerns, possibly for others) vertex streams are merely a pain in the butt. If you have 6 attributes each in their own buffer, OpenGL makes it work no differently from having 6 attributes in separate buffers. D3D makes you go through some vertex stream nonsense to make this work.

And, as pointed out beforehand, the construct is entirely meaningless from a performance or functionality standpoint. So, without any overriding need for the feature, I don’t see the point in having it.

Originally posted by Jan:
Well, i can live with the fact, that the format object needs to be the same. However, i think this should be an implementation detail, that is handled behind the scene and not something a developer needs to worry about.
Why would you ever need multiple copies of the same format? An application-level format cache is a good idea if you really can’t structure the code otherwise. We could even provide such a cache in a layered utility library, if the need is common.

The API is optimized for peak efficiency. Optimizing for apps with complex object management semantics adds overhead for apps which don’t need this level of assistance. We could implement a bunch of caches under the covers, but then we could just stick with the old state machine model, too.

We debated this very point and reached the conclusion described in the newsletter. If you feel strongly that we made a mistake, I’d love to hear your reasoning.