glGetTexSubImage

Sized internal formats only describe the sizes of Pixels

To be more precise they describe the existence, size and number-format of a pixel’s color components. The order of their storage is given also. You really know that RGB means Red, Green, Blue in that order don’t you? The Definition-holes that might still be there as the byte-order used internally could be closed by one or two declaratory sentences in a spec-file.

It’s hard to have a discussion when you keep jumping back and forth between different points.

That’s the dialectic course of discussion. In my oppinion the term “basic functionality” could have pointed in that direction.

Noticing that I’m falling into the bad habit of quoting you and answering one point after another I’ll simply refer you back to the beginning of the discussion and remind you of it’s dialectal nature. Don’t you notice yourself that questions like the one brought up at the end of your last post are simply ridicioulus? You take different things mentioned as missing and form a either-or question of it. As if the one would logically contradict the other. As far as I am concerned the point has been made clear. Not that this is likely to matter at all…

The order of their storage is given also.

No it isn’t. The implementation is free to store the binary data in whatever component ordering it wants. If it wants to store the bytes with green first, that’s a legitimate implementation. When you fetch it in the shader, the red will be the first component automatically (barring any texture swizzling, of course).

The Definition-holes that might still be there as the byte-order used internally could be closed by one or two declaratory sentences in a spec-file.

And by putting those “one or two declaratory sentences”, you’re basically saying that if their hardware works a different way, they cannot implement OpenGL. That’s a horrible idea; OpenGL should not enforce something like this when it doesn’t have to.

More importantly, my main point is that describing the storage of an individual pixel isn’t enough. There’s more to texture storage than an individual pixel. Most textures are stored swizzled, where pixels are stored such that locality is maximized. For example, if you have GL_RGBA8, that’s 4-bytes per pixel. Let’s say that a cache line is 64-bytes in size. So a single cache line fetch will read 16 pixels.

If you stored the data linearly, each cache line would access 16 horizontal pixels. However, as we know, textures are almost never accessed horizontally. A bilinear fetch from a fragment shader needs a 2x2 block of pixels. To get that from a linearly stored texture, you’d need to fetch two cache lines. However, if every cache line stored a 4x4 block of pixels, rather than a 16x1 linear array, then you would only need one cache line for a bilinear fetch. Oh sure, some will need two or four, but if you’re covering the whole face of a primitive, the number of times you’ll need more than one is greatly diminished. Also, you’ll sometimes need 4 cache line fetches for the 16x1 case two. Indeed, since you’re typically fetching a whole pixel-quad of texture samples (since fragment shaders run in 2x2 groups), you’re really needing to read a 4x4 block of pixels.

This is called “swizzling” of the texture’s storage. Rather than storing texel data linearly, it’s stored in these groups. Some swizzling is scan-like within the 4x4 block. Other swizzling will have sub-swizzles (each 2x2 block in the 4x4 is itself swizzled, and the 4 2x2 blocks in the 4x4 are swizzled). Different hardware has different standards, but virtually every piece of graphics hardware does swizzling.

A proper abstraction of textures, which OpenGL provides, allows different hardware variances on these issues. Different hardware can swizzle, or not, as it sees fit. And because the internal layout of pixels in the hardware is not exposed by the API, OpenGL is able to support any hardware via a simple black-box model. All the driver needs to do is swizzle the data the user provides from glTex(Sub)Image, and unswizzle it via glGetTexSubImage/glReadPixels.

That’s why the Intel map texture extension requires an explicit setting pre-storage creation flag to say that the texture won’t be stored swizzled. And you can’t map the texture unless you force it to be linear. So if you want to use textures as buffer object, you too would need some way to tell the implementation to not swizzle the image.

If you were unaware of all this, perhaps you should spend some time learning how things currently work before suggesting how they ought to work.

Don’t you notice yourself that questions like the one brought up at the end of your last post are simply ridicioulus?

If I had reason to think the question was ridiculous, I wouldn’t have asked it. You brought up each of those points, completely unbidden by anyone else mind you. So it’s not clear what exactly you’re talking about at any particular point.

Or more to the point, you went off-topic when you brought up “It would be nice to be able to bind the pixel-data of textures directly to some buffer”. I was just following your digression.

All of this is still ignoring the synchronization and pipeline draining needed to do such a readback. Here’s a test - every place in code where one would like to have a hypothetical glGetTexSubImage, instead put a glFinish call. Because that’s what it will be the equivalent of. Is it still acceptable?

…unless the read is done to a buffer object (GL_PIXEL_PACK_BUFFER) then the flush is needed only on buffer object read.

And by putting those “one or two declaratory sentences”, you’re basically saying that if their hardware works a different way, they cannot implement OpenGL. …

Right. If their Hardware is unable to read a few numbers out of memory in a given order they cannot implement OpenGL (4.)5. What’s the Problem? The further discourse of how cached memory-access works is pointing the direction. How difficult would it be to write texture-accessing methods for modern GPUs that did not try exploit cache-lines? (I didn’t even bother to read the link you provided.) You pretend you’re talking about hardware-issues all the time? Aren’t those thingies programmable? The sentence

Different hardware can swizzle, or not, as it sees fit.

is an example. If the Hardware cannot randomly access it’s own memory then there is a real problem. Otherwise it’s just the one picture of how things ought to be done trying to exclude the other. As this thread and Forum-category is about the proposed target- not the is-state I wouldn’t bother about optimizations to current implementations of the API’s IS-state not to work exaclty the same way. And then again - there is no Problem to keep those optimizations for cases where they are appliciable. I do not see a contradiction here. And I can - without a Problem - write all this without knowing exactly how someone decided to optimize certain - that is: the ones defined and/or implied by the api as-is - use-cases. And that simply because I know: there first are the definitions and then the implementations. Not the other way around.
The note about transparently buffered textures not being as optimizable as opaque textures is something that belongs into the programming-guide - not the specification. The same would be the case for a warning that getTexSubImage might lead to a read-back from GPU-Memory and hence consume some time. I dunno about the exact DMA-timings these days but I guess it cannot be more than a few hundret clock-cycles before the data-transfer is O(n). Which means something in the µ-seconds scale of delay. Such a delay can’t possibly be the reason to rule out functionality that would cause such a delay because of considerations it would lead people to use such functionality and hence write applications that caused such delays.

About the off-topicness: You are aware of the course the discussion took, aren’t you? But if it eases your mind I could open another thread specific to transparently buffered textures and whole-fully dedicate this one to GetTexSubImage - although I would not know what there is to discuss about it. We’re not Driver-implementors that have to care about how this could be done as fast as possible. We’re users of the api wondering about missing functionality…

…which depends on when you read the buffer object. If you need it in the same frame - you’re screwed - now you have to wait for all pending GL calls to complete, as well as the transfer from texture to buffer object. If you can wait until a few frames later it’s OK, but I get the impression from the OP that he needs it in the same frame (otherwise he’s going to be performing physics/etc on out-of-date data) so reading to a buffer object seems a strawman in this particular case.

Could you Elaborate in concrete terms what “you’re screwed” means? When needing the results from a previous Operation it is clear that the Operation has to be finished before going on in a trivial reading. I guess “you’re screwed” only means “it is impossible to exploit (in-)dependencies via multi-threading” in that case.

…which depends on when you read the buffer object. If you need it in the same frame - you’re screwed - now you have to wait for all pending GL calls to complete, as well as the transfer from texture to buffer object.

I’d imagine reading the buffer object after swap buffers would be good enough… however, even that much waiting seems quite extreme. It is not like an immediate based renderer waits for swap buffers before doing anything. Best thing to do I would guess would be to use a sync object and query the sync when operation is done. Once done, then do the buffer object read… if one needs to values at some point to continue, then they will bite the bullet and cause the stall, but if there are other rendering bits going on and if the values are not needed by the CPU immediately, then I strongly suspect that the sync jazz will prevent a lot of stalls even if the values are used/needed in the same frame.

All depends though on how much GL stuff is between the height map render and when it is needed by the CPU.

Multithreading is irrelevant here. You are aware that the GPU and CPU are separate processors, aren’t you? And that they run asynchronously? And that there can be a ~3 frame latency between the GL commands you submit and them making it all the way through the pipeline and onto the screen? And that if you do a readback - particularly a readback from something that needs to wait until a late stage in the pipeline - then you’re not just waiting for one operation to complete; you’re waiting for ~3 frames worth of operations to complete?

That’s what “you’re screwed” means.

About what 3 Frames on what do we talk here? Looking at the data of the bus-System does not imply that those are in the higher MHz scale. Assuming that the GPU is able to do 50-60 Frames per second how could that possibly be true? Flushing execution causes to wait. Ok so far - this is done every frame. And then? The GPU is idle… Do you mean it takes >50 MS for a command to arrive and/or get recognized by the idle GPU? Are Rendering over Network?