PBO + glCompressedTexImage2D async is slow (DXT)

Hello,
I want to load .dds DXT1/3/5 async, but I am unable to do it, as DXT1 2048x2048 takes 2-5ms to upload to gpu. Tried everything, googled all pages… Read tons of threads here, and still no idea. Maybe Dark Photon will help me :slight_smile:

My mapping/unmapping/loading multithreaded code works on uncompressed PNG fine, pbo->texture takes 0ms for 2048x2048 (measured using std::chrono)

So what I am doing:


int texWidth = 2048;
int texHeight = 2048;
int blockSize = 8; //or 16 dxt3 or dxt5
int dxtSize = ((texWidth+3)/4) * ((texHeight+3)/4) * blockSize;

glBufferData(GL_PIXEL_UNPACK_BUFFER, dxtSize, nullptr, GL_STREAM_DRAW);
PBOptr = glMapBuffer(GL_PIXEL_UNPACK_BUFFER, GL_WRITE_ONLY)
memcpy(PBOptr, imagePtr, dxtSize);

and once loading thread is done, in the main thread I unmap and dump content from PBO to texture


glBindBuffer(GL_PIXEL_UNPACK_BUFFER, PBO);
glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
glCompressedTexImage2D(GL_TEXTURE_2D, 0, GL_COMPRESSED_RGBA_S3TC_DXT1_EXT, texWidth, texHeight, 0, dxtSize, nullptr); //takes 2-5ms
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

Tried DXT5 aswell, different compressing tools, mipmap/nomipmapped, texture loads correctly, but takes lot of time. Tried glTexStorage2D+glCompressedTexSubImage2D, even preallocate glCompressedTexImage2D with nullptr or mix everything with glTexImage2D (hah). Still it takes >1ms. Where is the problem? It looks like glCompressed(Sub)TexImage2D has to do internally some memcpy which takes time.

Maybe this helps:


OpenGL renderer string: Mesa DRI Intel(R) Ivybridge Mobile 
OpenGL core profile version string: 4.2 (Core Profile) Mesa 18.1.8
OpenGL core profile shading language version string: 4.20

EDIT:
Using Intel HD4000

EDIT2:


$ glxinfo | grep EXT_texture
    GLX_EXT_texture_from_pixmap, GLX_EXT_visual_info, GLX_EXT_visual_rating, 
    GLX_EXT_import_context, GLX_EXT_texture_from_pixmap, GLX_EXT_visual_info, 
    GLX_EXT_texture_from_pixmap, GLX_EXT_visual_info, GLX_EXT_visual_rating, 
    GL_EXT_texture_array, GL_EXT_texture_compression_dxt1, 
    GL_EXT_texture_compression_rgtc, GL_EXT_texture_compression_s3tc, 
    GL_EXT_texture_filter_anisotropic, GL_EXT_texture_integer, 
    GL_EXT_texture_sRGB, GL_EXT_texture_sRGB_decode, 
    GL_EXT_texture_shared_exponent, GL_EXT_texture_snorm, 
    GL_EXT_texture_swizzle, GL_EXT_timer_query, GL_EXT_transform_feedback, 
    GL_EXT_texture, GL_EXT_texture3D, GL_EXT_texture_array, 
    GL_EXT_texture_compression_dxt1, GL_EXT_texture_compression_rgtc, 
    GL_EXT_texture_compression_s3tc, GL_EXT_texture_cube_map, 
    GL_EXT_texture_edge_clamp, GL_EXT_texture_env_add, 
    GL_EXT_texture_env_combine, GL_EXT_texture_env_dot3, 
    GL_EXT_texture_filter_anisotropic, GL_EXT_texture_integer, 
    GL_EXT_texture_lod_bias, GL_EXT_texture_object, GL_EXT_texture_rectangle, 
    GL_EXT_texture_sRGB, GL_EXT_texture_sRGB_decode, 
    GL_EXT_texture_shared_exponent, GL_EXT_texture_snorm, 
    GL_EXT_texture_swizzle, GL_EXT_timer_query, GL_EXT_transform_feedback, 
    GL_EXT_texture_border_clamp, GL_EXT_texture_compression_dxt1, 
    GL_EXT_texture_filter_anisotropic, GL_EXT_texture_format_BGRA8888, 
    GL_EXT_texture_rg, GL_EXT_texture_sRGB_decode, 
    GL_EXT_texture_type_2_10_10_10_REV, GL_EXT_unpack_subimage,

[QUOTE=zipponwindproof;1292924]I want to load .dds DXT1/3/5 async, but I am unable to do it, as DXT1 2048x2048 takes 2-5ms to upload to gpu. …
My mapping/unmapping/loading multithreaded code works on uncompressed PNG fine, pbo->texture takes 0ms for 2048x2048 (measured using std::chrono)[/QUOTE]

A couple things that occur to me here.

First, timing specific GL calls doesn’t really tell you much. The driver might block or yield anywhere for any one of a number of reasons. If you’re going to use CPU timers, what I’d recommend is that you time entire frames. After calling SwapBuffers, call glFinish(), then snapshot the timer. Frame time is the delta between two adjacent timer snapshots. Disable VSync so that you’re timing pure rendering (and not VBlank waiting).

With that tool in hand, you can enable and disable specific behavior and then get a frame time delta that resulted from adding specific behavior. Use that on your compressed and uncompressed subloads to get some new times. There’s no way that filling an uncompressed 2048x2048 texture is really taking 0 msec.

Now, to you problem…

I have no personal experience with Intel GL drivers, so I can’t advise you specifically here. However, here are a few general thoughts on what you’re doing.

First, are you making GL calls on multiple threads? If so, I’d stop doing that.

The way you’re mapping the buffer is likely to cause synchronization in the GL driver. See this page for details and alternatives: Buffer Object Streaming.

You’re also orphaning the PBO buffer every time you upload, possibly with different buffer object sizes. This probably isn’t a good idea (but it depends on your driver).

Also, when you’re latching the data into the texture, your code is using TexImage, which reallocates the texture MIP. I know you said you tried other permutations, but you should preallocate your textures up-front, pre-render with them to force the driver to actually create them in GPU memory, and then at runtime, just “subload” into them (TexSubImage) so that you don’t take the hit for preallocation, but just uploading the texel data into the texture (with the GPU silently tiling/swizzling the texels in the process). Note that you can subload less than 1 full MIP level at at time with DXT by subloading on 4x4 block boundaries. This is useful to be able to tune the amount of time you spend per frame uploading texture data to the GPU.

Keep in mind the whole process that occurs at runtime:

copy data into PBO -> [upload to GPU mem] -> copy texels into GPU texture -> [tile texel data into texture] -> render with texture

The steps in brackets are things that happen behind-the-scenes and can take some time. If your program expects these to take zero time, then you end up with blocks in the driver because it’s waiting on some of these steps to complete. So it’s best to allow some time here between steps to allow these behind-the-scenes steps to happen while you’re doing other things. How much time is needed for each of these steps depends on your GL driver and system hardware.

Thanks for the formatting. I was on mobile version when creating this thread and did not see how to put code tags.

Before I reply to your message, here are some my new findings measured still old way.

So now I am loading and rendering grid of 36x same ‘uncompressed’ png rgba 2048x2048
(glTexImage2D/glTexStorage2D + glTexSubImage2D on GTX1060 is +50us faster as single glTexImage2D, but on HD4000 single glTexImage2D is 100x slower, thus 5000us)
First linux machine: HD4000 4.2 (Core Profile) Mesa 18.1.8, i7-3667u - first image takes ~2000us, the rest is ~40us.
Second linux machine: GTX1060 4.2 NVIDIA 396.51, i5-5400 - first image takes ~1000us, the rest is ~250us.

And here is same grid of 36x same ‘compressed’ dxt5 2048x2048:
(glCompressedTexImage2D/glTexStorage2D + glCompressedTexSubImage2D was a bit faster (+50-100us) than single glCompressedTexImage2D)
First linux machine: HD4000 4.2 (Core Profile) Mesa 18.1.8, i7-3667u - first image takes ~4000us, the rest is ~4000us.
Second linux machine: GTX1060 4.2 NVIDIA 396.51, i5-5400 - first image takes ~200us, the rest is ~200us.

So here it seems first machine much older with HD4000 loads first png image slowly, but then it ‘caches’ it somehow and copies from pbo to texture even faster than GTX1060. But it doesnt handle compressed images properly, as it loads 4ms all of them, while GTX1060 is 0.2ms, which seems to be the number I want. (please note I am new at OpenGL, especially at PBOs and compressed textures and ‘profiling’, I dont know what numbers to expect, but numbers at GTX1060 look somewhat ok to me. I dont even have real ‘game’ test scene, only standalone test scene).

NVIDIA looks much better in all cases, it is what I actually target (but I dev on linux hd4000 and target gtx on windows, so dunno what drivers/results are there), seems old Intel is unstable and still not meant for such performant things.

Side note: to make glTexStorage2D work I had to use GL_RGBA8 not GL_RGBA, took me time to figure it out.

All tests above are with PBO bound.

Without PBO I am getting:
HD4000 13000us for glTexImage2D and 5000us for glCompressedTexImage2D.
Nvidia 5000us for glTexImage2D and 1700us for glCompressedTexImage2D…

Later I will ask friend to test on his old GeForce 9500 GT/PCIe/SSE2, OpenGL core profile 3.3.0 NVIDIA 340.107. I think he has same issues with glCompressedTexImage2D as I do have on HD4000, but not sure about glTexImage2D.


Now to reply to your suggestions:

  1. I will try later to time delta between frames.
  2. I dont call GL in multiple threads, only in one. In main thread which has GL context I spawn PBO, get ptr from glMapBuffer, which I then use in second thread without GL context, where I load texture and then memcpy to ptr generated by glMapBuffer. After memcpy is done, I signal main thread with gl context that texture is loaded in PBO. In the main gl thread I then bind PBO and call glTexImage2D/glCompressedTexImage2D with nullptr.
  3. I will read about Buffer Object Streaming once I get time, thanks.
  4. In my test case I spawn c++ object with own PBO (in this case 36 objects to be rendered with 1 texture, thus 36 PBOs), but I use only one thread to process queue and load texture and once its marked as ready I spawn PBO and after unmap is done I delete PBO, so in theory there shouldnt be many PBOs allocated simultaneously as they get unmapped (and deleted) ‘faster’ than the next is allocated and memcpy from disk to PBO mapped ptr. What are better ideas about solving this? Using 2 pingpong PBOs, or some pool of 10, 20 PBOs ? What is optimal here and how much memory does PBO take?
  5. About preallocating: Now I preallocate using TexImage and subload TexSubImage, but I preallocate it right before I bind PBO and unmap where I then subload it. It helps. Prellocating it before ‘runtime’ would be a bit hard when my application is 3d game.
    “Note that you can subload less than 1 full MIP level at at time with DXT by subloading on 4x4 block boundaries. This is useful to be able to tune the amount of time you spend per frame uploading texture data to the GPU.” Wow, I didnt know about this. Do you know where to read more about this?

All I want is smooth movement in the world, while loading 2k textures, but I did not find much about this googling around. But it seems its simple and my ancient HD4000 is crappy, which is not meant to handle such 3D game worlds anyway, but as I use it as a dev machine I got disappointed by ‘wrong’ results. Would be interesting to try to run it under other nvidia cards and also on windows.

Sorry for exhaustive reply. Maybe someone will find it useful in future.

No problem! You can post source code with syntax colorization using [noparse]

...

and

...

[/noparse], or just post it without source colorization with [noparse]

...

[/noparse].

Lots of good information in your last post BTW!

So here it seems first machine much older with HD4000 loads first png image slowly, but then it ‘caches’ it somehow and copies from pbo to texture even faster…

5) About preallocating: Now I preallocate using TexImage and subload TexSubImage, but I preallocate it right before I bind PBO and unmap where I then subload it.

This “first subload slow” behavior is likely because nothing you had done up to that point had forced the driver to allocate the texture in GPU memory.

This is what I meant by you should “pre-render with [the textures during init] to force the driver to actually create them in GPU memory, …”. Just because you’ve created and populated the GL texture via the OpenGL API does not mean that the driver+GPU has actually done that work and created+populated a texture in GPU memory. I can tell you for sure that on NVidia GL drivers at least, the driver waits as long as it can before it actually does this allocation an initial upload. Pre-rendering with the texture before you need access to it to be fast forces the driver to perform this work.

Side note: to make glTexStorage2D work I had to use GL_RGBA8 not GL_RGBA, took me time to figure it out.

Yes, glTexStorage2D is a newer API that requires you to provide a sized internal format. GL_RGBA8 is one of these. On the other hand, GL_RGBA is an “unsized” internal format which leaves the driver guessing what you really want, so it’s not valid here.

  1. In my test case I spawn c++ object with own PBO … but I use only one thread to process queue and load texture and once its marked as ready I spawn PBO and after unmap is done I delete PBO, …
    What are better ideas about solving this

For recommendations, definitely see the Buffer Object Streaming wiki page. Also, I wouldn’t create and delete PBOs on the fly for best performance.

…how much memory does PBO take?

The driver can do anything it wants since it largely hides what’s going on behind the API (a partial exception being persistent/coherent mapped buffers), but in general the amount of GPU memory consumption seems to be about the number of bytes you allocate for the buffer object. That said, the driver can and does create extra instances of buffer objects (or pieces of them) for data transfer purposes, particularly when you explicitly orphan a buffer (see that last wiki page for details). But these are probably allocated from CPU memory.

Sure. I’d recommend the original DXT (aka S3TC) extension: EXT_texture_compression_s3tc.
There’s also a page in the wiki that mirrors some of the raw block encoding details from the spec: DXT (OpenGL Wiki)

All I want is smooth movement in the world, while loading 2k textures,

I know what you mean! That’s the same reason I dug into this too. This is completely doable (have done so on NVidia), as long as you’re sane with your GPU upload bandwidth requirements.

Would be interesting to try to run it under other nvidia cards and also on windows.

If you post a short, stand-alone test program, I can run tests on Windows and Linux with a different NVidia GPU. That said, I can tell you having done OpenGL dev work on both Linux and Windows with NVidia drivers, the driver quality is high in both cases, the drivers actually share a lot of code, and performance between them is very comparable in my experience.

Thanks a lot for help and for keeping it alive here. Gave me lot of information while reading other threads aswell.

I’ve moved my code to 3D world to play a bit and loaded 256 same models each with 3 DXT5 2048x2048 textures. Each texture is 4194304 bytes, so 25634194304 = ~3.2GB. All rows of models load sequentially without freezing, and then the last row of model is loading and I get slowdown. nvidia-smi reports me my app uses 2956MB + Xorg it makes 3010MB/3011MB (I have GTX1060 3GB).

What happens when my VRAM is used to the maximum and I run out of it? Does it swap to RAM? Do I get some artifacts? This scenario is just for tests, in real world I wont have so many DXT5, and use fewer models and also DXT1.

I still did not have time to read about buffer streaming and still doing all tests naive way with generating PBO for each texture and then removing it at unmap.

This is what I meant by you should "pre-render with [the textures during init] to force the driver to actually create them in GPU memory

What if I init app and do not know about the models (what textures resolutions will be needed) which will be loaded in the future, as where player will move in the world? Should I blindly generate textures (say 100x 2048x2048 and 100x 1024x1024) with null data and then store them into some map and during gameplay when player requests model with 2048x2048 textures check if I have some ‘free’ unused generated GLuint handle with resolution 2048x2048, if yes, then write data to it and mark handle as already used? Does this approach make sense? And does this bring something to me?

This is completely doable (have done so on NVidia), as long as you’re sane with your GPU upload bandwidth requirements

What I dont fully understand is how to think about gpu and my game in terms of bandwidth. How to calculate it, what numbers should I take care of? For now only number I am checking is VRAM usage. But I plan to use deferred rendering and people say how gbuffer should be thin. They reconstruct positions from depth, reconstruct normal Z component and squeeze it as much as possible even reduce final quality. For sure as of OpenGL 4+ I can use bindless textures and I won’t have to put diffuse and other maps to gbuffer, so it should be thin, but what do they mean by bandwidth? Isnt bandwidth on pci few gb/s ? Why is it a big deal for me then? How could I even hit this? That is somewhat noob question after few yrs in ogl, but would help me a lot to finally understand it.

If you post a short, stand-alone test program, I can run tests on Windows and Linux with a different NVidia GPU.

I won’t bother you with testing code for me, but thanks for that offer :slight_smile: Sooner or later I will test it on windows and other gpus.

What if I init app and do not know about the models (what textures resolutions will be needed) which will be loaded in the future, as where player will move in the world? Should I blindly generate textures (say 100x 2048x2048 and 100x 1024x1024) with null data and then store them into some map and during gameplay when player requests model with 2048x2048 textures check if I have some ‘free’ unused generated GLuint handle with resolution 2048x2048, if yes, then write data to it and mark handle as already used? Does this approach make sense? And does this bring something to me?

Streaming in OpenGL is always kind of a pain. Creating and destroying memory objects (buffers and textures) is never the right way to go if you want fast streaming. While you can dynamically move stuff around in buffers, since any buffer can have multiple usages to it, textures don’t work that way.

The only really effective way to stream textures is to recycle texture objects. That is, you make your streamed data conform to known expectations. A streaming chunk should have X number of textures of one size/format, Y number of textures of another size/format, etc. Every streaming chunk must use the same numbers of textures of those formats (or at least, of view-texture-compatible formats).

So in OpenGL, each block consists of a fixed number of texture objects of known sizes/formats. When a block streams out, those textures become available. When a new block needs to be streamed in, you load those blocks into the now available texture objects.

Stuff like this is part of why Vulkan exists; to allow a low-level and therefore more effective streaming system. There, you can deal directly with memory. Your stream blocks can be of a fixed number of bytes, and you can create/destroy textures of arbitrary formats within that storage, without allocating/deallocating that storage.

Isnt bandwidth on pci few gb/s ? Why is it a big deal for me then?

Well just run the numbers. PCI/e 2.0 has a theoretical transfer speed of 8GB/sec. If you want to run at 60fps, that means that you have 1/60th of that number per frame. So that’s 136MB/frame.

Now, a 2048x2048x32bpp texture takes up 16MB of storage. So, if you want to transfer some data in one frame, and have it available by the time the next frame starts, you can transfer up to 8 such textures.

And that is the best case scenario; reality tends to be a lot less theoretical than that.

It doesn’t take much to saturate a PCI/e bus. Oh, it’ll get there. But if you need that data on the next frame, then you need to make sure that you’re not transferring too much data.

[QUOTE=zipponwindproof;1292947]…loaded 256 same models each with 3 DXT5 2048x2048 textures.
Each texture is 4194304 bytes, so 25634194304 = ~3.2GB.[/QUOTE]

First, does each model “really” have unique textures? That is you’ve got 768 unique textures? Do they really need to be unique? (I don’t know your problem domain; maybe they do.)

Second, if you render the textures minified and care about visual quality, you’re going to want MIPmaps on those textures. That’d be 5.33 MB/texture * 768 = 4 GB of texture. Without MIPmaps, it’s 3GB (where GB = 2^30).

All rows of models load sequentially without freezing, and then the last row of model is loading and I get slowdown.
nvidia-smi reports me my app uses 2956MB + Xorg it makes 3010MB/3011MB (I have GTX1060 3GB).

That sounds about right. As you’ve discovered, you’re overrunning GPU memory.

If you care about performance, you need to scale back on the amount of GPU texture memory you’re using (either by reducing the max resolution of your textures, reducing the number of textures simultaneously in existance, and/or reducing the bytes/texel of the texture formats that you’re using).

What happens when my VRAM is used to the maximum and I run out of it? Does it swap to RAM?

You’ve got it. That’s exactly what happens, to the detrement of your rendering performance (**)

(**) …for textures that you are not managing the GPU residency of directly using bindless texture residency routines. If you were, you’d get GL_OUT_OF_MEMORY for some of your calls before you got to the end of your list. Making the texture GPU resident with glMakeTextureHandleResidentARB() is probably a decent alternative to pre-rendering with the texture to force it to be allocated on and uploaded to the GPU.

What if I init app and do not know about the models (what textures resolutions will be needed) which will be loaded in the future

See Alfonse’s response.

If consistent frame rendering performance is important to you, you want to preallocate your GL resources and re-task them at runtime (i.e. just change the content). However, if lurches in your frame rate is not a problem for your app, then by all means dynamically create and destroy resources at render time.

What I dont fully understand is how to think about gpu and my game in terms of bandwidth. How to calculate it, what numbers should I take care of?

Ultimately what’s important to you isn’t bandwidth, it’s frame time. What you need to decide is: 1) how many msec of total frame time do I have to render a frame (typically < 16.66ms for 60Hz LCDs), and 2) how much of that can I afford to use for subloading textures to the GPU (e.g. 3 ms/frame).

Then you provide a knob in your application that allows you to tune up/down how much texel data per frame you will upload to the GPU (probably a number given in MB/frame – this is where the bandwidth comes in). That said, you’re going to tune this knob to get the frame time consumption that you’ve decided on (e.g. 3ms/frame).

In doing this, you’re going to get a good feel for what kind of GPU upload bandwidth you can get given your upload technique, the GL driver you’re using, and the hardware (CPU, GPU, PCIe bus, CPU memory, etc.).

For now only number I am checking is VRAM usage. But I plan to use deferred rendering and people say how gbuffer should be thin.

Deferred rendering typically has fatter framebuffers (more bytes/pixel) than standard forward rendering framebuffers. So what people are saying is just a recommendation to make your GBuffer as thin as possible.

That said, the space consumption for your framebuffer is typically much less than the space you’ll be consuming on the GPU for textures and buffer objects (on a desktop GPU anyway). Just estimate it, and you’ll see that it’s pretty small by comparison.

Well just run the numbers. PCI/e 2.0 has a theoretical transfer speed of 8GB/sec. If you want to run at 60fps, that means that you have 1/60th of that number per frame. So that’s 136MB/frame.

Now, a 2048x2048x32bpp texture takes up 16MB of storage. So, if you want to transfer some data in one frame, and have it available by the time the next frame starts, you can transfer up to 8 such textures.

Ultimately what’s important to you isn’t bandwidth, it’s frame time. What you need to decide is: 1) how many msec of total frame time do I have to render a frame (typically < 16.66ms for 60Hz LCDs), and 2) how much of that can I afford to use for subloading textures to the GPU (e.g. 3 ms/frame).

Now it makes sense, thanks for clarification.

First, does each model “really” have unique textures? That is you’ve got 768 unique textures? Do they really need to be unique? (I don’t know your problem domain; maybe they do.)

Second, if you render the textures minified and care about visual quality, you’re going to want MIPmaps on those textures. That’d be 5.33 MB/texture * 768 = 4 GB of texture. Without MIPmaps, it’s 3GB (where GB = 2^30).

Yeh its just nonsense test app, in real scenario I will bind texture once and draw multiple models instead.

It works well, even with buggy test code. Here is video preview https://www.youtube.com/watch?v=qIEYCiR0XMc
I will add pool of PBOs and reuse them. Thanks for all the suggestions.