Updating a texture in the background / slow upload?


I have a application where I constantly display a texture, which gets the occasional update. Now I update the texture using a PBO (persistent mapped), the PBO gets modified from a second thread. In the “main” thread (I currently use a single GL context), the texture gets updated using glTextureSubImage2D. That takes quite some time (Geforce 970, 15ms for a 1920x1080 R16F texture, measured using Nsight). That leads to my first question: why is this so slow?

Second, I was thinking of “double-buffering” the displayed texture, rendering one and updating the other and flip once the upload is complete. But is there a way to know when to “flip” the textures? Changing the texture right after calling glTextureSubImage2D won’t help, as it is async (as I understand) and the renderer would just wait for the upload to be finished? Would a second context with sharing help here? Or is there another way to update a texture in the background?

thanks for any hints

One possibility is that the driver is blocking until any rendering commands using the previous texture contents have completed.

glTextureSubImage2D() is synchronous. You’re guaranteed to be able to free or modify the memory (whether client-side or a buffer object) as soon as it has returned (with certain caveats, e.g. mapping buffers with the unsynchronised flag).

If you’re trying to perform updates without stalling either side, you need enough storage to match the depth of the pipeline. So two copies isn’t guaranteed to be enough if the interval between flips is short. If you want to modify GPU-side data without blocking the CPU thread, you need to use fences to ensure that the GPU has finished using the data you’re trying to modify.

Hi, thanks for the quick reply :slight_smile:

What do you mean by “enough storage”? My idea is to not stall the GPU, CPU is not a problem right now (CPU thread updates the PBO once a second). If glTextureSubImage2D is indeed synchronous, I would be able to flip the displayed texture right after it returns? I thought the upload is async, so that I can do some more CPU work and using a fence to ensure the upload has finished before using the texture.

Would it be possible / feasible to do the upload from a second shared context, use a fence there and use the result on the second texture?

The time between issuing a command (calling an OpenGL function) and the GPU completing execution of that command may be long. Like, several frames. Between those two points, the command is “pending”. If you try to modify something (e.g. a texture) that is used by a pending command, the driver may just block until it has finished executing the command. So if you create a texture, draw something using that texture, modify it, draw something else, modify it again, draw something else, …, and you don’t want any part of that process to stall, the hardware has to store every version of that data required by a pending command.

In some cases, it may automatically allocate storage for some of the intermediate versions. In other cases, it will just wait until any pending commands which were using the thing that you’re trying to modify have completed. Which could take a long time. If you don’t want that to happen, then one way is to simply never modify anything; just create a new object and use that instead.

Not stalling the GPU boils down to sending it commands at least as fast as it is executing them. So you may need to avoid stalling the CPU in order to avoid stalling the GPU.

Yes. The texture won’t necessarily be updated on the GPU for some time after the call returns, but any commands which are enqueued after the glTextureSubImage2D command won’t be executed until after the texture has been updated. Also, if the data source is a PBO, any modification to the PBO’s contents needs to wait until the texture has been updated from the PBO.

If you replace the entire PBO with glBufferData(), the driver may choose to “orphan” the existing data store and allocate a new one. The data will immediately be copied from client memory into the new store, the old store will be freed automatically once any pending commands (e.g. glTexSubImage2D) using that data have completed. If you replace a portion with glBufferSubData(), this is unlikely; the driver will wait until pending commands using the modified region have completed before glBufferSubData() returns. Similarly, if you map the buffer for writing, the driver should wait until pending command have completed. With persistent mappings, you have to handle synchronisation yourself.

Copies from client memory to GPU memory are synchronous. Copies from GPU memory to GPU memory are appended to the command queue. The driver doesn’t wait for the copy to complete before the corresponding function returns, but commands enqueued by subsequent functions won’t start until the copy has completed (unless the driver can determine that the order doesn’t matter). If you’re using PBOs, it’s the gl[Get]Buffer[Sub]Data() or glMapBufferRange functions which will introduce CPU-GPU synchronisation issues, not the glTex[Sub]Image() functions.

If you’re using multiple threads, you may not need a fence; you can just let the uploading thread block until the upload has completed. A fence is more useful with a single thread where you’re trying to determine whether the GPU has finished using some data, so that you can modify it without blocking.

By default, OpenGL behaves as if everything executes immediately. Any deferral (backgrounding) is transparent. Functions will wait if they need to wait. If you’re copying data back to client memory, that has to wait until the data is available. If you’re copying data from client memory, either the driver has to copy that to temporary storage, or whatever you’re overwriting must be “done with”, or the function will wait until that is the case.

[QUOTE=pettersson;1290378]I update the texture using a PBO (persistent mapped)… from a second thread. In the “main” thread (I currently use a single GL context), the texture gets updated using glTextureSubImage2D.

That takes quite some time (Geforce 970, 15ms for a 1920x1080 R16F texture, measured using Nsight). That leads to my first question: why is this so slow?[/QUOTE]

Post some code, detail your timing method, and describe what you’re already done to try and get a line on the problem. There are all kinds of possible reasons.

Better yet, post a simple, stand-alone GLUT or GLFW-based test program that illustrates the problem.

If you haven’t already, I’d recommend that you read these:

[li]Buffer Object Streaming (GLWiki)[/li][li]Asynchronous Buffer Transfers (OpenGL Insights)[/li][/ul]

GPUView can give you a deeper look at how your workload is being processed by the GPU and layers of the driver. But let’s first establish whether your timing method is reasonable.