Pipeline Newsletter Volume 4

V-Man: From how i understood the previous newsletters, you can create an additional LP context in your existing GL app and then you can use both. Both contexts are separated, there will only few functions be, that accept objects from the other context, like you could attach a texture of the old context in a RTT operation that you do with the new context. So that you can use the new functions for some effects, but use the result in your existing engine.
At least that is how i understood it. If that will actually work as expected is a completely different question.

Jan.

Yup, that’s my take on it too. Should make the transition less painful for large projects.

But personally, I’m going all in. Clean sweep, baby! Wahoooo!

Nothing beats a fresh start :slight_smile:

from what I understood you can create a legacy GL context within LP, which would then emulate the legacy GL using the LP API. Wouldn’t make much sense the other way round.

While I haven’t read all of the newsletter yet, I have some comments on some stuff I encountered.

Under “Buffer Object Improvements” the word “may” is used way too much for my liking - perhaps especially under “Non-serialized access:”.
The word “may” unfortunately have completely different meaning in english vs. legalese. OpenGL is a software contract (which would be legalese-ish), but the words here are english. I’d prefer if a more formal, perhaps RFC-like, language was used for those parts - to keep the language english, but keep the function specifications more formal and unambigous.

If the specification for any function can’t be, well, specific, it should be reworded or pulled. If the intent is clear, and it can be reworded to remove any shadows of any dubts, it should be done - but only after that’s done should it be reconsidered. For now, I consider this part void.

I find the “const” keyword nowehere. Not that I think ARB missed this vital C language construct - I just wanted to point it out, as I’ve seen larger and more widespread libraries mess this up (f.ex. MS for the longest time thought, and still thinks (!), it should have write-access to your in-system-memory indices, if one is to trust their API).

Another thing about buffers. Partial invalidation. For this to be efficient, it’d have to not only be aligned to source platform’s physical page size (or greater - in case of Win32 you map on at least 64KB boundaries, whether or not you have 4KB pages), but also destination platforms alignment requirement. Will LP provide functions, or enums, to query alignment requirements/enforcement?

If one idea is to have (the ability to have) everything in objects, and be able to only hand over object handles to the functions, wouldn’t it make sense to have an array [a 2-dimensional one] with every row holding a number of object handles (including vertex and index handles), and then have a call to “just draw this batch” that eventually could be evaluated all on the GPU?

I’m thinking of something like:
name = CreateArrayName();
AddArrayType(name, VERTEX3);
AddArrayType(name, NORMAL);

Create, either in system memory or in a buffer on the GPU, an array of [N][M], handing it off to the GPU, and it can then switch programs/states/textures/<whatever> as efficient as it possibly can - especially if it can run many of these tasks in parallel (and let’s face it, we can’t predict the future, but we can look at history to say parallelism will grow, as currently sequential speeds can’t be strained much more).

Initially I’d expect such a thing to run mostly on CPU, and mostly sequential, but while we’re anyway designing a new API why not think of what may be possible or even norm tomorrow.

The last part was just an idea, but I think it may have use and performance improving impact.

Originally posted by tamlin:
I find the “const” keyword nowehere. Not that I think ARB missed this vital C language construct - I just wanted to point it out, as I’ve seen larger and more widespread libraries mess this up (f.ex. MS for the longest time thought, and still thinks (!), it should have write-access to your in-system-memory indices, if one is to trust their API).
“const” is not a keyword in C. It is a C++ type modifier.

Originally posted by tamlin:
Another thing about buffers. Partial invalidation. For this to be efficient, it’d have to not only be aligned to source platform’s physical page size (or greater - in case of Win32 you map on at least 64KB boundaries, whether or not you have 4KB pages), but also destination platforms alignment requirement. Will LP provide functions, or enums, to query alignment requirements/enforcement?
No need, the implementation can simply do the alignment itself and copy more than was flagged modified if necessary. It’s an implementation detail that should be abstracted away from the user.

Originally posted by tamlin:
If one idea is to have (the ability to have) everything in objects, and be able to only hand over object handles to the functions, wouldn’t it make sense to have an array [a 2-dimensional one] with every row holding a number of object handles (including vertex and index handles), and then have a call to “just draw this batch” that eventually could be evaluated all on the GPU?
You mean like a display list object? They’re working on it.

“const” is not a keyword in C. It is a C++ type modifier.
“const” was not a keyword in the original K&R C, but it is in the ANSI C89 standard. I think it’s safe to assume all compilers adhere to a 18 year old standard, especially for such a widely used feature…

Under “Buffer Object Improvements” the word “may” is used way too much for my liking
The newsletter is not the spec. I’m sure in the spec they will formulate everything as unambiguous as possible :wink:

Originally posted by knackered:
“const” is not a keyword in C. It is a C++ type modifier.

Rubbish, since at least 18 years.

[About buffers alignment] No need, the implementation can simply do the alignment itself and copy more than was flagged modified if necessary. It’s an implementation detail that should be abstracted away from the user.
See, there the efficiency took a real hit. This area was for the really hard-core down-to-the-metal uses from what I read. If my reading was correct, this was for the ones willing and able to go low, really low-level. As such, I’d expect both host CPU, bus, and target CPU- and GPU alignment requirements to be able to be met without implementation intervention (that would by neccessity decrease performance).

As noted, if wanting to upload a subset and not care about alignment, BufferSub* is already there.

Perhaps I read too much into the performance thinking? Perhaps I didn’t. Let’s leave that for a comment from the ARB.

<snip>

You mean like a display list object?
I couldn’t have said it better myself (as is obvious! :slight_smile: ). Yes, almost exactly like a display list of objects (though with a constant number of objects/list_entry - to allow for simplified array traversal).

They’re working on it.
Excellent!

Originally posted by Overmind:
The newsletter is not the spec.
I’d hope we all involved think that’s obvious. :slight_smile:

Still, as the newsletter about this area did leave so much to interpretation due to this seemingly innocent three-letter-word, I wanted it to be known to the ARB too.

History has (or should have) tought (sp?) us that ambigous wording has created incompatible implementations. I rather flag for non-problems at design stage, than having to file bugs after implementation.

Tamlin, can you construct a hypothetical situation where two implementations might in fact be incompatible, where a correctly written program generates correct results on one but not the other ?

I can see how the latitude expressed in the article leaves freedom to the LP implementor on a number of levels, but IMO that can lead to variance in levels of performance, not in correctness. If there’s a specific issue that’s been missed so far, let’s examine it in more detail here.

edit - in case it wasn’t clear I’m asking about the “mays” in the description of the new buffer object functionality, with respect to non-serialized access (or any other usage).

Well I never, const is a C keyword.

Rob,

With current wording, I can’t, as it leaves too much room for interpretation. Not implementation freedom, but interpretation. Let me elaborate on what I especially opposed - the wording for non-serialized access:

“When this option is engaged, lpMapBuffer may not block if there is pending drawing activity on the buffer of interest”.

This can be read as “shall not block” or “will not block” (is forbidden to), “is not intended to block, but is allowed to” or “usually blocks, but is allowed to not block”. Any of these behaviours are AFAICT valid interpretations both from an implementors and a users POV.

If I now write a program with realtime demands (in this area) that expects the “will not block” behaviour, but the implementation interpreted “I’m allowed to block”, that difference in interpretation of “may” can and/or will break my programs expected behaviour.

“Access may be granted without consideration for any such concurrent activity”.

Again, “may” can mean “will” or “shall”, “is allowed to” or “is allowed to not” grant access.

In any case it’s so vague it basically reads “You can’t depend on behaviour”. The result of that would be (to me) it’s so “shaky” one should really stay away from it. Quite the opposite of what I expect the ARB’s intentions are with committing time to designing it.

Anyway, this was as previously noted not an API spec but an article, and I think I may (pun intended) have pushed this too far already. I expect the specification to be non-ambigous, and hope we get a chance to look at the final API draft before it’s carved in stone.

A few things to keep in mind here -

a) you are absolutely right, that sentence in the article could have been written a lot better. Instead of saying “When this option is engaged, lpMapBuffer may not block if there is pending drawing activity on the buffer of interest” - an improved phrasing would be “This option can eliminate the need for lpMapBuffer to block, if activity is pending on the buffer”. Note that an implementation may have any number of private reasons to block on this call, that’s not something that can be legislated away by the spec.

b) the spec makes no performance or timing guarantees. It is intended to specify behaviors and outcomes for correctly written apps. For this reason, discovering that some implementations run slower than others (for whatever reason, possibly including blocking when you don’t want it) doesn’t indicate a nonconforming implementation or bug - it is what it is, some implementations will be more aggressive than others. The flexibility in the language allows for that range of aggressiveness.

c) simply put, some drivers may not implement non-serialized access, and for some workloads that ask for it, they will not run as fast - this is not a violation of the spec or a conformance failure - the apps will still run and generate correct results. The only kind of app that will generate different results are those that are not correctly scheduling/fencing their accesses in conjunction with the unserialized option, and that’s an app bug.

d) I don’t personally believe that high performing OpenGL apps are successfully written or delivered without some level of testing on the target configurations. That testing process should highlight any performance hot spots or issues. If you find that your app benefits greatly from non-serialized access on one vendor’s GL but suffers on another that is blocking more often, then you have every right to have a conversation with that vendor about the performance issues you are running into and what your options are. IMO this is not much different than using VBO or VAR today, there is a spectrum of implementations out there with varying performance characteristics.

A key issue here is that not every vendor has the same set of constraints or audience of developers to work with - and not all vendors will approach the task of implementing LP with the same level of aggressiveness w.r.t. performance. So there was a choice, to require true non blocking behavior on all conforming implementations, or to provide flexibility in implementation whereby an implementor could choose how far to go with it and still conform to spec.

It would be nice if something like the OpenGL spec could offer performance guarantees but at present this is not the case. The intent here was not to stick with a lowest-common-denominator approach (for example, not having the option at all), but to provide more performance headroom for aggressive implementations.

I’d also point out that the strict write-only, explicit flush, and invalidate-range options - used independently of non-serialized access - also open up a range of usages that wasn’t possible before and in a pretty efficient way. So if the correctness and testing cost of developing code using the unserialized-access option is too high to bear, it may make perfect sense for an author to avoid it.

tamlin,

You are correct that the word “may” needs to be used carefully. In fact this very issue came up last week during an internal spec review.

We’ll try our best to get the wording right in the final spec. If an ambiguity slips through, it will be neither the first time nor the last. Don’t despair; spec bugs can be fixed.

With respect to alignment - please consider that this functionality has been reviewed in great detail by individuals from a variety of companies who are familiar with the capabilities of their respective hardware. Let us worry about making our implementations efficient. What’s more interesting feedback is whether the described behavior is useful and complete.

I think feedback on API design is complicated by the fact that other than plotting colored pixels on the screen, the rest is mostly about efficiency, and few have the interest, expertise and insider’s knowledge to make a holistic assessment of what’s required or even makes sense within the scope of LP.

On that note, one thing I’d be interested in hearing about is whether there will be an API to validate shader inputs (VAOs) against a particular vertex shader in advance, analogous to the InputLayout and VS signature pairing in d3d10, or if that’s not really a serious performance concern in the current design for LP/ME. Not that I particularly care one way or the other, it’s just that I and others may have to factor abstractions around this sort of detail.

I understand that the spec describes functionality and not performance. However the feature of non-serialized access is all about performance. No one is intended to use it just for fun, but only as a low level, down and dirty way to squeeze the last bit of speed out of the GPU.

As such, it just does not make sense to allow the driver to do it better or even worse than the other options. However, of course a spec cannot force a driver writer to optimize some feature. Especially not, if it is a - possibly - rarely used feature.

So my suggestion is this: The app should be able to query the driver, whether this feature is “good”. And how to do that? Well, why not put this into an extension? The driver already needs to implement all the other ways to handle arrays, why not make this one optional? If the extension is supported, one can expect, that it is at least as fast as all the others, usually better. If it is not supported, just use the default path.

This would remove the burden from the application writer to test several graphics cards from different vendors and then hardcode “if its NV, disable it, if it’s ATI, enable it, except for the mobile GPUs, …”.

I don’t want features in the core API, that will not be supported well on all hardware (again).

Another idea: I’d like to be able to query more detailed, what hardware my app is running on.
For example:
Vendor: NV / ATI / INTEL (unique, not changing with every driver release!)

GPU Architecture: Geforce 6 / 7 / 8 … (the basic architecture, not detailed)

GPU Model: Geforce 8600 GTS SSE2 3DNow! … (the stuff that is in there today, usually)

Hardware acceleration: true / false

GPU Memory: x MB (yes i know those discussions…)

Driver Name: Forceware

Driver Version: 1.2.3 (only a number, no text in here)

This way, IF anyone would ever want to use a feature based on the hardware one is running on, it will make our lives much easier to distinguish between them.

Jan.

One thing I consider important is ability to retrieve/set objects which can take long time to generate (most notably compiled shaders) as some driver dependent blob so the application can store them on the disk and avoid the compilation cost during next run unless the hw or driver changes (in which case the driver will reject the blob and application will regenerate the object in ordinary way).

Is something like that planed for the LP?

We are discussing solutions to the problem you describe. Its unlikely to be solved for Longs Peak because of the schedule pressure, but this first release is just the beginning. :smiley:

Its unlikely to be solved for Longs Peak because of the schedule pressure
Ahh, what a perfect seque into matters of scheduling.

Like, when will we see LP released? Are you guys still on-track for a summer release (presumably in time for SIGGRAPH), or is it being pushed back into September?

Also, is there any indication from ISVs how long it will take to start seeing beta implementations (I have no faith that initial implementations will be anything more than beta quality) of LP?

For the answers to these and other questions, please come to the OpenGL BoF at SIGGRAPH. :slight_smile:

Originally posted by Michael Gold:
For the answers to these and other questions, please come to the OpenGL BoF at SIGGRAPH. :slight_smile:
…or wait for the presentations to be made available :smiley:

Originally posted by Korval:
…when will we see LP released?
I’m taking 2 to 1 odds on them releasing it at SIGGRAPH, 5 to 1 odds that NVIDIA will have an implementation, and 10 to 1 on ATI having an implementation, but maybe I’m just dreaming… :smiley:

Regards
elFarto