Zero overhead texture upload via host-accessible VRAM?

ceisserer · September 12, 2019, 2:58pm

Hi,

I have to upload very small texture data (32x32px, A8) with high frequency - texture data is read only once. However driver’s don’t seem to cope with this very well - most drivers are optimized for large uploads and while most implementations of glTexSubImage2D seem to create a driver-internal copy of the data to avoid stalling - using epxlicit buffers has higher driver overheads due to all the explicit interaction with driver state.

Therefore I wonder - is it possible with OpenGL to create a texture with texture data stored in the host-accessible part of VRAM? This way I could write from the CPU directly to the texture data (uncached, write-combined) and could avoid any buffer-synchronization primitives. The only thing I would require would be a fence as confirmation an memory-area has been read to know when I can safely overwrite an area.

I wonder, is this at all possible with OpenGL (preferred) or Vulkan?
Some pointers would be really welcome.

Thanks and best regards, Clemens

Alfonse_Reinheart · September 12, 2019, 3:06pm

There is no way to do this in OpenGL, and while Vulkan permits it, it does not require implementations to allow you to do it.

In Vulkan, what you have is a distinction between linear textures and tiled textures. A linear texture is one that exists in a well-understood format, such that the CPU can address individual texels of the texture in memory. A titled texture cannot be so addressed.

However, Vulkan does not require that linear textures can be used for, basically, anything outside of copy operations. Implementations are free to allow you to use linear textures as source images for texture fetching operations in a shader or for other things. But they aren’t required to do so. Nor are they required to do so for any particular format.

Also, Vulkan implementations don’t have to allow linear textures to be in device-local memory. They may allow that, but they aren’t required to do so.

So you can take advantage of this in Vulkan, but only if you do a bunch of queries beforehand to make sure that everything you need to do is actually supported by the implementation.

That being said, it’s really better to avoid this to the extent possible. If you’re uploading lots of tiny textures, bundle them together by putting all of these textures in a single texture atlas. Also, make sure to employ PBOs and persistent mapped buffers.

ceisserer · September 12, 2019, 4:01pm

Hi Alfonse,

Thanks a lot for the detailed answer.

There is no way to do this in OpenGL, and while Vulkan permits it

I already had the feeling that this is prevented intentionally - every step to upload texture data seems to provide the driver with the possibility to re-organize the data according to internal layout.

However reading opencl documentation, it seems (most, well except nvidia) hardware is really capable these days with coherent memory sometimes even providing the same virtual address space as the host CPU. So the GPU can dereference CPU-pointers. How truly awesome! So somehow I thought there has to be a way to use this for texture uploads too…

That being said, it’s really better to avoid this to the extent possible.
If you’re uploading lots of tiny textures,
bundle them together by putting all of these textures in a single texture atlas.

Unforntunately I don’t have a lot of control and can’t manually batch uploads. The code in question is a legacy-2d-to-opengl library with immediate rendering. I could buffer each rendering request, but I guess this would consume quite a bit of the savings again.

Thanks for the vulkan pointers, I am curious to do some experiments and see how it works / performs.

Best regards, Clemens

arekkusu · September 12, 2019, 4:30pm

Historical note: some platforms defined extensions to do it (APPLE_client_storage, APPLE_texture_range, APPLE_fence) and documented the usage patterns. Zero-copy texturing was pretty important in 2002 (e.g. accelerated window server, video playback, etc.)

Alfonse_Reinheart · September 12, 2019, 5:07pm

Actually, that reminded me: Intel (of all people) made an extension to allow for mapping textures. Only Intel ever supported it, though.

Of course, all of Intel’s hardware are integrated, so there really isn’t any distinction between GPU and CPU memory.

arekkusu · September 12, 2019, 5:26pm

See also NV_pixel_data_range, NV_fence circa 2000.

Dark_Photon · September 13, 2019, 2:24am

Performing texture sampling from a GL texture backed directly by host-accessible memory has already been covered here.

Of course that’s not your actual goal, but rather a guess at the best solution for that goal.

The real goal being:

Maximum throughput uploading small images to the GPU and reading from them on the GPU

Here are a couple other ideas for you to consider.

1) “Texture” from a uniform array or buffer object (UBO)

Your images are tiny – 1024 bytes. With this idea, we don’t even use a GL texture. We just upload the texels into an array somewhere that a shader can read from very quickly and that we can update very efficiently.

Standard uniform array updates pipeline very well and would typically be stored directly on the GPU multiprocessor’s local shared memory. No special sauce is needed to update these efficiently.

UBOs similarly reside directly on the GPU multiprocessor’s local shared memory. Because of this, read efficiency should be comparable to (though perhaps slightly less than) standard uniforms. Different from standard uniforms/uniform arrays however, you update these via a buffer object interface. This can be done efficiently with Buffer Object Streaming techniques (namely PERSISTENT|COHERENT maps, or UNSYNCHRONIZED mapping). Naive buffer object updates which do not use these techniques suffer from slowdowns due to implicit synchronization.

In both of these cases, you sample your “texture” in the shader using simple addressing for storing 2D data in a 1D buffer: y * width + x.

Storing your small texture image in other buffer objects is equally possible (TBO, SSBO, etc.), which can also be updated efficiency with Buffer Object Streaming. However, access on the GPU end is less efficient as these are typically stored in GPU global memory (VRAM), and so initial read latency is that of main GPU memory reads. That said, the buffer region storing your texture is small, and (at least on NVidia GPUs), some of the GPU multiprocessor’s local shared memory is devoted for use as a cache for GPU global memory. So storing your “texture” in these types of buffer objects might still perform well. But what’s the point when UBOs and standard uniform arrays are an option and likely faster to read from.

Note that with all of the above possible storage locations, you completely bypass the GPU’s need to tile your texel data (some vendors call this swizzling) before it can render with it, along with any cost of synchronization and internal buffer transfers that might go along with that tiling process. The tradeoff is that your texel data is arranged linearly so access to it might not be quite as efficient (definitely not for large textures). However, your images are so darn tiny that access may still be so efficient (due to caching well or by design being located so close to the GPU multiprocessor cores) that this non-optimal memory organization may not really make any difference performance-wise.

2) Pipeline GL texture updates via PBOs efficiently updated using Streaming Techniques

Texel data → PBO → GL texture(s)

Transfering your texel data through a PBO updated efficiently via Buffer Object Streaming techniques (that is, using PERSISTENT|COHERENT maps, or UNSYNCHRONIZED mapping) can largely-if-not-completely eliminate all of the implicit synchronization required to upload the texel data from your app to the PBO (a GPU buffer object). And with vendor-specific methods, you can decide whether this PBO should be located in GPU global memory (VRAM) or in CPU pinned memory mutually addressible by both the CPU and the GPU. So the first stage can be made pretty efficient.

With the data in PBO, the PBO → GL texture transfer/tiling should generally pipeline very well. But to maximize its efficiency, you need to think like a driver.

First, if you tell OpenGL to update (write to) a texture from the CPU while there is a command already in the pipeline directing the GPU to read/use the contents before your write, the driver will have to either:
a) synchronize (block) until the texture read(s) are done,
b) ghost (duplicate) the old contents of the texture before the updates, or
c) pipeline the update content in the command buffer (to defer it until later).
To avoid all of these possibilities in the driver (especially the first two), don’t just repeatedly write to and read from the same GL texture each time. Use a ring buffer of GL textures. That is, if A, B, C, etc. are textures, instead of doing: write A, read A, write A, read A, do write A, read A, write B, read B, write C, read C. Allocating enough textures for 2-3 frames or so should be enough to avoid texture contention.
Second, when possible you want to ensure that you are using an GL internal format for your texture that is natively supported by the GPU/driver (and thus will not suffer the cost of expensive texel format conversion inside the driver). To do this, use Image format queries. This allows you to determine an optimal internal format for the GPU/driver (GL_INTERNALFORMAT_PREFERRED) as well as the GL format and type to use when subloading texel data into those internal formats (GL_TEXTURE_IMAGE_FORMAT, GL_TEXTURE_IMAGE_TYPE). In your case, you’d want to check on that GL_R8 format. (While you’re at it, you can check GL_ALPHA8 or GL_LUMINANCE8 too, …but I wouldn’t hold my breath.)

Have fun!

ceisserer · January 5, 2021, 8:52am

Hi again,

Thanks for all the suggestions and sorry it took so long to come back to that topic.

1) “Texture” from a uniform array or buffer object (UBO)

I am currently experimenting with the UBO approach suggested previously, however instead of UBOs I am using SSBOs - and at least on my AMD iGPU I get crazy fast results, approaching shared-memory bandwith of that system.

The buffer is mapped with WRITE_ONLY | COHERENT | PERSISTENT, so effectivly I am getting uncached write-combined memory access directly to the GPUs VRAM.
What remains to be seen however is how efficiently the CPU->VRAM updates are when writing to a card connected via PCI-e, I’ll update this post once I have some benchmarks.

Thanks and bestr regards, Clemens