As you know, calling glTexSubImage2D on a texture that is currently in use may cause the receiving texture to be ghosted - as in, a temporary copy is created to which the changes are applied to until work using the original resource is finished.
Now my question is whether the drivers can avoid ghosting if the source of the glTexSubImage2D is a pixel buffer. My rationale would be that since the driver has full control over who modifies that buffer and when it should be able to just preserve that buffer until it can apply the changes to the original resource.
“Ghosting” is done so that the transfer operation can take place while the conceptual resource is still being used in its original form. Hardware cannot do that, so a “ghost” has to be created to be the destination of the transfer during usage. But the overall point is to allow overlap between the transfer and the usage of the resource.
PBOs change nothing about this. PBO-based uploads don’t have to copy to internal storage to do the hardware transfer from, but they still have to do a transfer. And hardware still cannot overlap that transfer with usage. So if the implementation wants to pretend that it can, it will have to create a “ghost” image to do that with.
The only way to avoid this is for the implementation to choose to prevent the overlap: to have one operation wait on the other. But that means throwing away optimization opportunities.
I see. What about putting the upload image into its own texture, then use glCopyImageSubData or glBlitFramebuffer to copy them to the final texture? I vaguely remember glBlitFramebuffer acts as a drawcall internally not as a transfer operation that has to be processed immediately.
That’s just you implementing ghosting yourself. Except you don’t have the tools to be able to prevent the transfer command from executing until the commands that are using the texture have completed (not without the CPU waiting). So it will be a less-capable form of ghosting.
Why? Because glCopyImageSubData or glBlitFramebuffer ghost themselves? Or do you mean the additional texture I intend to use?
The reason why ghosting is a problem for me is that it creates a full copy of the receiving texture which may be pretty large. My intend was to create a texture in size of the small source image which can transfer immediately because it is not in use. Once it is in VRAM copying it to the large texture should be a VRAM to VRAM copy without ghosting right?
The general method to avoid texture ghosting instigated by “CPU-side” texture content updates (e.g. glTexSubImage2D()) is to multibuffer. That is, create a ring buffer of textures and never change a texture “from the CPU” until 2-3 frames have passed (or in general, until after the last GPU-side read of a texture has completed on the GPU).
Note that ghosting is typically not instigated by GPU-side texture content updates (e.g. rendering into them as bound buffers of the active DRAW_FRAMEBUFFER). However there, you end up multibuffering the FBO because often tiled rendering state (e.g. tile parameter buffers) is associated internally with the FBO, and you want to have the draw queuing+rendering for multiple frames to be in-flight at the same time. That’s essential for good performance on mobile / tile-based GPUs. Failing to multibuffer FBOs results in implicit sync on the CPU waiting for the GPU to “finish up” all the rendering previously queued for that FBO in the last frame. Use the GPU vendor’s profiler to see these implicit syncs in-action.
It depends on how the driver handles deferred PBO → texture subloads.
If this happened in the GPU’s timeline and if it knew that you hadn’t changed the relevant subset of the PBO’s contents, and if the destination texture was only changed in the GPU’s timeline, then theoretically it could avoid the ghost.
But you can see there’s a lot of “ifs” here. So I wouldn’t expect this to be the case even on one specific GPU vendor’s driver unless their devtech guys have told you this is the case and you have verified it. But even then, you probably need to support more GPUs, drivers, and driver versions than 1. So still you end up not depending on this.
Any operation which modifies a texture that is in use could cause ghosting. The only “guaranteed” (since we’re talking about deep driver behavior, nothing is ever guaranteed) way to avoid this is to modify the texture only after you know the texture is no longer in use. OpenGL doesn’t make it easy to know when that is without doing a CPU wait.
But your suggestion is to do that manually. So if your problem is memory allocation, you haven’t really solved that problem. You just put it more directly under your control.
As I’ve explained, the GPU cannot simultaneously read an image while it is being modified; one of those processes must wait for the other. So if you ask the driver to do that, in an attempt to prevent waiting and improve concurrency, the driver will create a copy of the image that it then transfers to.
Put simply, the thing that causes ghosting is asking OpenGL to write to the texture at a point in time when the GPU is reading from it. What “RAM” they are in is irrelevant to this question. So long as you issue a command that modifies an image while that image is in use, ghosting is a possibility.
The idea is to delay the transfer to the in-use texture to a point where it is not in use anymore by using a copy method that works like a drawcall and not like a transfer operation. If OpenGL can make sure that one drawcall doesn’t start before the preceding one ends it should be able to provide a copy operation that doesn’t start until the preceding calls end right?
To put it more bluntly:
The equivalent of attaching the receiving texture to an FBO and then draw a rect to it with the source texture as input. Then OpenGL should be able to just delay this drawcall until the preceding drawcall that holds the receiving texture is done right?
You want to do a transfer to a currently unused texture. Then you want to take the existing, potentially in-use texture, and attach it to a framebuffer object. Then you attach the newly uploaded texture to a different framebuffer object. And then you want to do a blit between the two. And you’re hoping that this blit will not perform a DMA but instead will use rendering resources.
Note that in Vulkan, vkCmdBlitImage is considered a graphics operation, but it requires that the images support the transfer src/dest usage flags, not the color attachment or sampled image flags that FBO attachments and textures use. It’s also not something you can do in a render pass. So there is absolutely no guarantee of what will happen.
And you’re doing this to reduce the amount of memory being consumed… how? Either way, you have two textures; whether they’re explicitly allocated by you or implicitly by the system, the memory consumption is the same.
You still have all of the difficulties of the original (two textures and the GPU memory associated with them), but now you have a brand new problem (all the performance you just threw away by doing two transfers, not to mention the FBO fiddling). Where is the gain here?
Yes, I would have two textures. I assume a scenario where I want to copy a small image to a large texture atlas. I don’t want the atlas to ghost due to its size and strict memory limits so I don’t want to use glTexSubImage2D. So I’m looking for a method that delays the write to the atlas until after already issued drawcalls have been processed in order to avoid ghosts. My assumption was that glBlitFramebuffer can use a small texture as source and a large texture as target and isn’t executed until previous gl calls are done so it wouldn’t ghost on the target.
So yes in a way I’m just trying to avoid the overlap between transfer and usage by not using glTexSubImage2D in a scenario where memory is my biggest problem.
For reference, what @unibodydesignn on the Imagination Tech PowerVR thread calls “twiddle and compress” and ImgTech sometimes calls “swizzling”, Vulkan calls “tiling”. That is, a texel re-ordering step that occurs during the texture subload which changes its layout from a LINEAR texture pixel layout (L->R, B->T) to the TILED layout it has when stored in GPU memory for fast access (…where texels close to each other in X or Y are “close” to each other in GPU memory, from an address and fast-access perspective; think something like Morton order or Hilbert curves; see Z-order curve).
The take-home is that a texture subload isn’t just a memcpy. But a texel reordering that may be performed by a completely separate GPU work queue (IIRC called a TRANSFER queue in Vulkan). With OpenGL, this is largely transparent to you).
Related: If you actually end up trying a PBO-based technique, be aware that (in my experience with ImgTech PowerVR drivers at least), the PowerVR OpenGL ES driver will “block” (implicit sync) if you try to change the contents of a buffer object from the CPU (e.g. changing a VBO or PBO via glBufferSubData(), glMapBuffer*(), etc.) and a GPU-side read for said buffer object is still outstanding. This matters most for buffer objects read in the fragment stage as these may be deferred for 1-2 frames, but also matters for buffer objects (e.g. VBOs) read in the vertex stage if changed multiple times per frame, as this can force a TA flush forcing the driver to catch up on pending vertex transform work so you can change the contents of the VBO.
This is all driver voodoo that just has to happen behind-the-scenes on OpenGL ES drivers on tile-based GPUs.