(could someone lend me their skills to keep posts short? )
Rob, while your b) is true, blocking vs. non-blocking can be a make-or-break depending on whether user and implementation agrees on interpretation. This particular feature is a performance promise, iff the user interprets the wording as “shall not block”. In that case the user “knows” that s/he can write data at around bus-speed (whether the bus is local PCIe or Token Ring ).
Michael, thank you for ACKing the “may” issue.
As for usability - to be able to map/flush sub-ranges of a buffer I consider potentially very useful (from a performance POV, obviously) instead of having to deal with multiple buffers.
I might want to request consideration for the feature others have requested - rebased indices, so that one could use f.ex. one index buffer, and one vertex buffer containing many “frames” of some (on-CPU calculated) animation where at time-of-use one could say “index 0 refers to vertex[4711], normals[5472], colors[0]”. I don’t know how useful it’d be in reality, but I can indeed see uses for it. A precomputed “path” for a cannon-tower on a tank turning. A “walking” or “running” sequence for a character…
However (didn’t you expect it ), for this with sub-mapping/flushing to be truly useful I think I’d have to know about the alignment restrictions, as it is stated in the article “This option allows an application to assume complete responsibility for scheduling buffer accesses”. The only piece of software that can tell me about (optimal) buffer alignment is the implementation.
Consider the following case if I didn’t know about mapping alignment requirements:
I create a large buffer. Let’s say I only use it for geometry data. I write a batch of vertices that ends in the middle of a “page” (*). I “flush” (perhaps even unmap is still a requirement?) this range to tell the implementation “you go ahead, I’m done with this range”, and issue a draw call. I then merrily continue to fill the buffer starting just after my previous end position (mapping that range first, if required) - that starts in the middle of the last “page” of the previously “flushed” area.
If this buffer (memory area) is truly “mapped” over a bus (PCI-ish), it means that either the implementation needs to take a private copy of this last page and place it somewhere else in the card’s memory (performance hit, not to mention requirement to fiddle with the GPU for this non-sequential “jumping around” in physical on-card memory when reading what should be sequential memory), or it needs to map the whole last “page” of the previous batch as writable again into my process’ address space - thereby giving me write-access to the data I already told it “I’m done with this” and allowing me to potentially interleave (bad) writes to an area the GPU is busy reading.
An even worse scenario would be something like:
- batch 1 writing “pages” 0-1.5
- batch 2 writing “pages” 3.5-5
- batch 3 writing “pages” 1.5-3.5
as it could require both “page” 1 and 3 be mapped for the third batch, while at the same time the GPU will be reading them (both).
I suspect this is the “room for programs to screw up” I read between the lines, but I think it can be improved to prevent this - while still providing maximum possible speed - by the simple addition of the following:
Had I on the other hand been able to query the implementation about alignment requirements, I could “line up” my next write to the next “page” boundary and start writing in a fresh “page”.
I therefore consider this alignment information vital for “proper” (in as-fast-as-possible, which seems to be the stated goal with) use of this feature, to be able to use it without creating neither full nor partial stalls at any level. That is, assuming I haven’t misunderstood something before this short analysis.
As for the problem of not being able to save a binary blob of a compiled program; why not simply reserve some space towards the beginning of the blob for the implementation to play with, say 8 or 16 bytes (heck, save it like a PASCAL string, prepending the info required to verify compatibility with a byte telling how large the “private” data is) where it can save e.g. PCI ID and/or driver version? Such a small change shouldn’t have to take more than a few minutes to implement (for each vendor), it would allow freedom of implementation (256 bytes is likely more than enough to verify compatibility) and it would be a user-mode side thing only with no need to send this verification data over the bus. Compare it to prepending a TCP packed with an IP header if you like (even that this header could, if you really need it, be variable size). That way vendors they can verify compatibility of current hardware with the pre-compiled blob, and simply report success/failure, and in case of failure I need to recompile the program. It seems so easy that I’m starting to fear I’m missing something obvious. Am I?
(*) I used the word “page” in the loosest sense. For an implementation on NT-based systems this alignment would be a “section” (64KB alignment), for a GLX implementation it’d likely be a host page. For e.g. Linux or FreeBSD with local h/w, I haven’t got a clue what they use as alignment for mapping h/w to virtual memory.
Oh, I almost forgot:
“We’ll try our best to get the wording right in the final spec. If an ambiguity slips through, it will be neither the first time nor the last.”
I know, that’s one of the reasons I bruoght it up. Will we get a chance to have a look at a “release candidate” of the spec. before it’s carved in stone? More eyeballs and such…