What is the best way to load textures asynchoronously?

I have a model with relatively large (~66MB compressed) textures. I’m loading these textures into RAM on a separate thread. However, my call to glTexImage2D() results in a noticeable stutter, blocking the application for around one second. I’d like to get rid of this, so these uploads don’t interfere with my UI.

Here’s what I’m currently doing:

// `data` comes from a second thread.
uint32 texture_id;
glGenTextures(1, &texture_id);

GLenum format = GL_RGB;
if (*n_components == 1) {
  format = GL_RED;
} else if (*n_components == 3) {
  format = GL_RGB;
} else if (*n_components == 4) {
  format = GL_RGBA;
}

glBindTexture(GL_TEXTURE_2D, texture_id);
glTexImage2D(
  GL_TEXTURE_2D, 0, format, *width, *height, 0, format, GL_UNSIGNED_BYTE, data
);
glGenerateMipmap(GL_TEXTURE_2D);

glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_REPEAT);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);

free_image(data);

After some research, I’ve determined two possible ways to achieve this.

  1. Creating another OpenGL context on a new thread, and performing texture uploads on that thread as I am currently. I’m happy to do this, but this sounds like some effort, so I would like to confirm that this is the best way of solving my problem.

  2. I’ve read that PBOs could help with this issue, as they would not block on glTexImage2D(). However, as I understand it, my application would still effectively freeze, as I can’t draw anything while texture data is being loaded into the PBO.

If anyone has any tips about the best way of loading textures without blocking the UI or introducing stuttering, I’d appreciate it a lot!

An alternative approach is:

  1. The main thread calls glMapBuffer to map the PBO. It can unbind the buffer once it has been mapped (or with 4.5 can use glMapNamedBuffer to avoid the need to bind it at this stage).
  2. The loading thread loads the data into the mapped region.
  3. The main thread unmaps the buffer, binds it to GL_PIXEL_UNPACK_BUFFER and calls glTexImage2D.

This way, the loading thread doesn’t need a context. It just needs to be able to synchronise with the main thread.

Thank you @GClements for your reply, it’s been very helpful. I’ve successfully implemented the approach you described, and it makes a lot of sense. To summarise:

  1. On a loading thread, I am loading the texture data from the images on disk.
  2. When that is done, the main thread creates a PBO for each texture and saves a pointer to its memory.
  3. The loading thread copies the texture data it previously loaded to each respective PBO.
  4. When that is done, the main thread creates a texture (glGenTextures) and copies the image data to it from the PBO (glTexImage2D(…, 0)).

However, I still get a significant stutter on step 4 of up to 2.5 seconds (for 15 textures). I’ve traced this to the glGenTextures call, which can take up to 220ms each time!

Therefore, I ask a follow-up question: what is the best way to create these texture names? Creating them when I load each texture seems to be very slow. Is it common to create a large number of texture names when the program starts (glGenTextures(128, …)), which is presumably faster, and take one from that list when one is required?

I’m very grateful for any advice!

That suggests that the call is causing synchronisation, i.e. waiting for all pending commands to complete as if you called glFinish.

Is this with a core or compatibility profile? In the core profile, texture names must be allocated with glGenTextures; glBindTexture will generate a GL_INVALID_OPERATION error if you use a name not so allocated. In the compatibility profile (and pre-3.1), glGenTextures is just a convenience function so you don’t have to keep track of which names are in use.

So it’s possible that glGenTextures performs an implicit glFinish to allow for the possibility that a pending command creates a texture using a name not allocated with glGenTextures.

That would probably be the simplest solution. Bear in mind that glGenTextures doesn’t create textures (glBindTexture does that when called with a name which doesn’t refer to an existing texture object), it just allocates names which are guaranteed to be unused at the time of the call.

Alternatively, if you don’t actually need the compatibility profile (or to support pre-3.1 OpenGL), try creating a core profile context. That might avoid the issue as glGenTextures can assume that any name not returned by a previous call is available.

Note that CPU timing in OpenGL is somewhat difficult to meaningfully perform, particularly around individual functions. A call to glGenTextures could be stalling for reasons that have nothing to do with that particular function.

One way to test this is to remove the texture upload calls themselves and re-time how long is spent in the function.

Thank you @GClements and @Alfonse_Reinheart — you were absolutely correct, the “glGenTextures” call is taking so long because it’s waiting for other work to be done. My code looks like this:

uint32 TextureSetAsset::generate_texture_from_pbo(
  uint32 *pbo, int32 width, int32 height, int32 n_components
) {
  uint32 texture_id;
  glGenTextures(1, &texture_id);
  glBindTexture(GL_TEXTURE_2D, texture_id);
  GLenum format = Util::get_texture_format_from_n_components(n_components);

#if 1
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, *pbo);
  glUnmapBuffer(GL_PIXEL_UNPACK_BUFFER);
  glTexImage2D(
    GL_TEXTURE_2D, 0, format, width, height, 0, format, GL_UNSIGNED_BYTE, 0
  );
  glGenerateMipmap(GL_TEXTURE_2D);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
  glDeleteBuffers(1, pbo);
#endif

  return texture_id;
}

This function runs in a loop, 15 times. If I replace “if 1” with “if 0”, glGenTextures takes no time at all. With the “if 1” block enabled, it takes up to 220ms.

According to my understanding, this block essentially copies data from the PBO into a texture. Is it possible that, when doing this for large textures 15 times in a row, this simply takes more work than I can do in one frame? Would the only way be splitting this work across multiple frames?

Alternatively, do you think this should all be able to run in one frame, and the problem is somewhere else? I am kind of surprised that copying image data from the PBO into a texture would take 200ms each time.

I’m very grateful for any tips. In them meantime, I’ll do my best to investigate more.

P.S.: I am, in fact, running in core mode.

Edit: To get a bit more specific, my texture total 103 megabytes. Would it be reasonable to copy that much from a PBO to a set of textures in a single frame?

You should not be allocating a single buffer object, uploading its data to a single texture, and then deallocating it. Keep that buffer around. In fact, allocate a single large buffer at the beginning of your application and keep using it for shuffling data to the GPU (with appropriate fences and synchronization, since it should be persistent-mapped). Never delete it until application shutdown.

Quite possibly. It isn’t necessarily just “copying”. Textures aren’t necessarily stored in raster (row-by-row) order, 24-bpp textures are typically expanded to 32-bpp. The glGenerateMipmap call will have some overhead.

For asynchronous loading, you might want to store all of the mipmap levels (this only adds 33% to the size of the data), uploading the highest (lowest-resolution) levels first and limiting the rate at which you upload data. You can use glTexParameter(GL_TEXTURE_MIN_LOD) to avoid sampling levels which haven’t been loaded yet.

Also: does it make a difference if you wait a frame between uploading the texture and using it for rendering? There might be internal (GPU) synchronisation at play.

Also, if you want async uploading, it’s absolutely imperative that you ensure that the pixel format of your data exactly matches what the GPU expects. If the CPU has to get involved, you can kiss any hope of performance goodbye.

So no RGB textures. Make sure that you ask what pixel formats your GPU prefers for your chosen internal format.

1 Like

Beyond the good suggestions you’ve gotten so far (matching pixel format, PBO), check out Buffer Object Streaming for efficient methods of transferring data from the CPU into that PBO that should avoid implicit synchronization.

If you’re not using a persistently mapped buffer, you should be mapping the buffer unsynchronized and/or orphaning to avoid implicit synchronization. I’d suggest avoiding any other form of MapBuffer, as they can lead to implicit sync (which can cause stuttering).

Also, don’t generate textures or allocate texture storage on-the-fly. Preallocate what you need at startup, force-render with them (to force the driver to really allocate the storage under-the-covers), and then at runtime, only subload into pre-created pre-allocated texture storage.

Be aware that the GPU cannot immediately render from the texel data you give it as-is. It needs to tile that texel data first. There is some time/cost associated with this. So if avoiding stuttering is a priority, I would suggest adding a budget for how much frame time (msec) you want to spend uploading texel data to the GPU and apply this using metrics you collect on how fast (e.g. bytes/msec) you can upload texel data to your GPU.

Thank you so much @GClements, @Alfonse_Reinheart and @Dark_Photon for your very helpful suggestions.

I’d like to summarise what I’ve done, both for myself and others reading this topic in the future:

  1. Instead of allocating a PBO each time, I allocated a larger PBO at the start of my program. I keep this PBO around at all times, and I load all textures into/from it. This has sped things up and also simplified my code. I bound it with the flags “GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT”. As far as I understand, I should still implement some synchronisation, so that data is not being written and read at the same time

  2. I’ve changed from a format of GL_RGBA to GL_BGRA. This has sped things up noticeably. For now I’m swapping R and B when I need to in my shaders, but I think it would probably be better/faster to read the data from the image directly as BGRA?

  3. Instead of creating texture names each time, I’ve created 128 at the start of my program, which I then use whenever I need a new texture. As some of you have helpfully pointed out, the large stutter on glGenTextures() was simply due to that fact that this call was blocking to wait for all the texture copying done before it. As a result, the stuttering has moved to my glfwSwapBuffers() call.

Overall, the stutter has reduced to around 700ms. As this now happens in the glfwSwapBuffers() call, I assume this is simply how long it takes to copy and prepare ~100MB of image data from the PBO to the textures. Annoyingly, this stutter still happens even if I don’t use the textures afterwards at all.

I’m loading 15 textures, so even if I were to load a single texture each frame, it would presumably still stutter for ~46ms, which would be visible. Therefore, I’m going to keep looking for ways to speed up this upload. I hope I don’t have to resort to creating another OpenGL context on a second thread!

If anyone has other suggestions, I’d naturally love to hear them! :slight_smile:

Edit: I’ve looked it up, and it seems like a normal texture upload speed should be around 5+GB/s, so I must still be doing something very wrong!

Edit: Some of my current code:

Creating a persistent PBO:

  glGenBuffers(1, &this->pbo);
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, this->pbo);
  GLbitfield flags = GL_MAP_WRITE_BIT | GL_MAP_PERSISTENT_BIT | GL_MAP_COHERENT_BIT;
  glBufferStorage(GL_PIXEL_UNPACK_BUFFER, this->total_size, 0, flags);
  this->memory = glMapBufferRange(
    GL_PIXEL_UNPACK_BUFFER, 0, this->total_size, flags
  );

Copying image data to the PBO:

    // For all 5 textures.
    unsigned char *image_data = ResourceManager::load_image(
      this->albedo_texture_path, &this->albedo_data_width,
      &this->albedo_data_height, &this->albedo_data_n_components, true
    );
    this->albedo_pbo_idx = persistent_pbo->get_new_idx();
    memcpy(
      persistent_pbo->get_memory_for_idx(this->albedo_pbo_idx),
      image_data,
      persistent_pbo->texture_size
    );
    ResourceManager::free_image(image_data);

Copying from the PBO to the texture:

  this->material_texture = global_texture_pool[global_texture_pool_next_idx++];
  glBindTexture(GL_TEXTURE_2D_ARRAY, this->material_texture);

  glTexImage3D(
    GL_TEXTURE_2D_ARRAY, 0, GL_RGBA,
    persistent_pbo->width, persistent_pbo->height,
    5, 0, GL_BGRA, GL_UNSIGNED_BYTE, 0
  );

  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_S, GL_REPEAT);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_T, GL_REPEAT);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MAG_FILTER, GL_LINEAR);

  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, persistent_pbo->pbo);

  // This happens 5 times (once for each sub-texture).
  glTexSubImage3D(
    GL_TEXTURE_2D_ARRAY, 0, 0, 0, 0,
    this->albedo_data_width, this->albedo_data_height,
    1, GL_BGRA, GL_UNSIGNED_BYTE,
    persistent_pbo->get_offset_for_idx(this->albedo_pbo_idx)
  );
  
  glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
  glGenerateMipmap(GL_TEXTURE_2D_ARRAY);

Which GPU, driver, and driver version are you working with?

You appear to be allocating texture storage at runtime (both with glTexImage3D() and glGenerateMipmap()). You should move all this storage allocation to startup. Don’t forget to prerender with the texture at startup after storage allocation request to force the driver to actually perform the texture storage allocation, or your first render with that texture after startup could be unexpectedly time consuming.

You also appear to be subloading an entire MIP level all at once.
Depending on the size of a MIP level and your max effective upload rate, this may be too much.

The runtime MIPmap allocation and generation that you’re doing (glGenerateMipmap()) isn’t free either. Preallocate MIP storage on startup (via glTexStorage3D()). Then bench these 3 solutions:

  1. Subloading base MIPmap (level 0) only without glGenerateMipmap(),
  2. Subloading base MIPmap (level 0) only with glGenerateMipmap() (what you’re doing now), and
  3. Subloading all MIPmaps, without glGenerateMipmap().

Re #3, you may find that you only need to subload the first N MIPmaps rather than all of them for your purposes (that is, you don’t have to upload them all). You can then tell OpenGL which subset of the MIPmaps you uploaded via the base and max level texture parameters, and it will constrain texture sampling to just those levels.

Finallly, if you post a short, standalone GLUT test program that illustrates your problem, folks can download and try it, to provide you further ideas and feedback.

1 Like

FWIW, I don’t think a coherent mapping helps you. You may still need a call to glFlushMappedBufferRange or glMemoryBarrier between writing to the PBO and the glTexImage* call. The coherent flag only means that the writes will become visible to the GPU “eventually”.

If the data stored in the file is RGBA, you’re going to have to swap at some point: either in the application, or on upload, or in the shader. If the GPU can swizzle it “for free”, that would be the way to go. If it can’t, then it’s better to do it once than on every access.

The buffer swap is where you expect synchronisation to occur if you’re giving the GPU more work than it can handle. As mentioned earlier, texture upload requires some work from the GPU. It’s quite possible that the GPU can do that asynchronously, but that doesn’t help if you try to read the texture as soon as you’ve uploaded it. There’s a saying that “the best time to plant a tree is twenty years ago”; with graphics, the best time to upload a texture is a few frames before you need to read it.

Does removing the glGenerateMipmap call have much effect?

1 Like

@Dark_Photon @GClements, thank you so much for your feedback. With your help, I’m pretty sure I’ve figured out the issue! I tried disabling mipmapping, and my textures loaded pretty much instantly. I decided something must be wrong with the glGenerateMipmap() call. I followed @Dark_Photon’s advice and allocated mipmaps on startup, up to a certain maximum level (4 right now). I’ve changed my code to look as follows.

Texture generation on startup:

glGenTextures(global_texture_pool_size, global_texture_pool);
for (uint32 idx = 0; idx < global_texture_pool_size; idx++) {
  glBindTexture(GL_TEXTURE_2D_ARRAY, global_texture_pool[idx]);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_S, GL_REPEAT);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_WRAP_T, GL_REPEAT);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MIN_FILTER, GL_LINEAR_MIPMAP_LINEAR);
  glTexParameteri(GL_TEXTURE_2D_ARRAY, GL_TEXTURE_MAX_LEVEL, global_texture_mipmap_max);
  glTexStorage3D(
    GL_TEXTURE_2D_ARRAY, global_texture_mipmap_max + 1, GL_RGBA8,
    state->persistent_pbo.width, state->persistent_pbo.height, 5
  );
}

Copying data:

glBindTexture(GL_TEXTURE_2D_ARRAY, this->material_texture);
glBindBuffer(GL_PIXEL_UNPACK_BUFFER, persistent_pbo->pbo);

glTexSubImage3D(
  GL_TEXTURE_2D_ARRAY, 0, 0, 0, 0,
  this->albedo_data_width, this->albedo_data_height,
  1, GL_BGRA, GL_UNSIGNED_BYTE,
  persistent_pbo->get_offset_for_idx(this->albedo_pbo_idx)
);
// ...and 4 more.

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
glGenerateMipmap(GL_TEXTURE_2D_ARRAY);

This code seems to be virtually visually equivalent to my old code, but textures now load instantly. There’s no stutter whatsoever, which I’m very pleased about, since it’s exactly the result I’ve been chasing after for the past few days!

Maybe I’ll uncover another issue. The longest frame took 12ms, I don’t have syncronisation in place, and I’m still getting some “Pixel transfer is synchronized with 3D rendering” warnings from OpenGL, but performance-wise, everything is exactly as I wished for it to be.

I’m very grateful for your help, I would have struggled a lot without it, so thank you so much for taking the time out of your day to help me with this issue, it’s brightened my day a lot! :slight_smile: I’ll try to post a video of the result soon.

Here’s a video of the result — doesn’t look that impressive in the video but it’s buttery smooth. Thanks again! https://www.youtube.com/watch?v=I6EE1jo51fE&feature=youtu.be

Sweet! Glad you got it figured out!

Ah! NVIDIA GL drivers. I get that too with some of the ops I’m doing. Great performance, and it’s not immediately obvious what needs to change to clear up this GL Debug Output perf warning. So I’m currently just ignoring it. I suspect it has something to do with the GPU’s render and copy queues synchronizing, though I’m not certain.