glTexXX3D upload speed from another thread

Hi All,

While trying to improve our volume rendering quality and increase the rendering speed (specially on older GPUs) I have added a gradient texture (instead of computing it in the shader). The gradient texture is computed more correctly and is smoothed to give better visual result.

I’m using a double buffer 3D texture and added a second thread that is shared with the main rendering thread and it update the “back” texture.
This works but causes “hiccups” in the main rendering thread when the volume is being updated, these hiccups are not caused because of any wait in my application but are in the driver.

Reducing the size of the volume eliminate the problem, also the updater thread calls glFinish to make sure the data is uploaded completely and only then signal the main thread.

We are using at least 4 contexts and they are all shared.

Do you know why this is happening?
Is there an internal copy between the shared context?
Any suggestion on fixing it?

Configuration:
Texture size: 256^3 - RGBA unsigned byte.
Cpu: Intel Quad core extreme.
GPU1: Nvidia Quadro4600 (G80)
GPU2: Nvidia GTX 260
Windows XP 32Bit

Any help appreciated,
Ido

Did you try updating the “back” texture little by little with smaller subsets, using glTexSubImage3D each frame ?

Did you try updating the “back” texture little by little with smaller subsets, using glTexSubImage3D each frame ?

No, but I’ve tried loading a smaller texture several time in the uploader thread with the same effect.

I can maybe try uploading the texture using several cycles, now it is updated every 300ms - I can upload 1/3 every 100ms. But I prefer more robust way to do it, or at least understand why uploading in a second thread causes the main thread to hang.

Ido

Some weird info:
Using pbo in the second thread helps a bit. This pbo is mapped, loaded, unmapped and glTexsub(…,0) in the update thread but still give a little better results.

I will also try just using pbo in the second thread (without context sharing and all the nice stuff)

Ido

This is maybe a stupid suggestion, but why do you update texture data using pbos in a second thread? The main interests of this extension is asynchronous and fast pixel data transfers so I don’t see here any advantage to do it in another thread.

About your problem, sorry I do not have any idea right know… Does it happen on ATI platforms (if exist) for example?

EDIT:

I am not used to program multithreaded applications but since you have several threads accessing concurrently the opengl driver, don’t you think this may be due to some synchronisation or priority problems?

Hi,

  1. My current implementation don’t use PBO.
  2. I believe it is a sync problem but it occurs in the drivers as far as I can see.
  3. If i wanted to use PBO for async transfer I must do it in another thread. map, copy, unmap is non-async operation. The main advantage of pbo as I see it is that you can map in the rendering thread, upload in a different thread, and unmap(and load to texture) in the main thread(should be now fast as all data in the card memory). The spec said it can save a second copy of the texture(don’t know where)

Ido

Some update:
I’ve divided the texture upload into several cycles. What I’ve noticed is that only when uploading small chunk(1/10 of the volume) every 50ms I get reasonable result (with 500ms delay).
I’ve tried several variants and saw that when I upload too much too fast the main rendering thread choke.
Using PBO in the context/thread boost performance by a factor of 2-3. (which I can’t explain)
I still have to try using PBO as my "back buffer.

I’m still having problems with this.
Any help guys?
Ido

Using PBO in the context/thread boost performance by a factor of 2-3. (which I can’t explain)

IMO this is not surprising since texture data upload is not performed by the driver itself but through DMA.

By the way, the main thread stall might be due to the driver stall when you are uploading huge pieces of texture data when you do not use pbos.
Since this is an operation that takes time, it may use all the driver allowed running time… but I am sure it is more complicated than that.

About the mapping operation that is blocking. You can call glBufferData on the bound pbo to allocate a new memory area then map this one. This way the mapping will be fast since the new memory area is not involved in drawing operations that may be performed by the GPU causing the driver stall.

In render thread map PBO, and leave it mapped. Then pass mapped pointer to loader thread. Loader thread load resource in that pointer. After it finishe its job post notification in renderer thread. Once per loop check notifications and unmap pointer. Then use data from PBO as you wish (glTexImage or glDrawPixels). Keep in mind that operation is still async, but if you try to map same PBO again right after glTexImage call then you can expect pipeline stall utiln selected PBO is freed by driver. You can try with two PBO’s and swap them after use, or try to trash data after use with glBufferData(NULL) after use.

Hi,
Thank you all for the help.

Yooyo - that exactly what I was telling dletozeun. I’m going to try it in the next few days(many tasks at once).
The weird stuff is even if I use PBO in the same thread it is faster than just glTexImage, I would imagine the driver to handle it beyond the scene ( maybe its because it async load and don’t block until the main thread uses the tex?).

One question:
In the rendering thread doing unmap,copy,glTex,unmap will return immediately? If I call glBindTex on the loaded tex and the data is still being swapped/loaded from pbo to texture will it stall?

Thanks,
Ido

No… it should not stall, because gl commads are queued.

You can hit stall only if you try to map PBO which is currently used by driver or call glGetTexImage from currently bound texture_id which is used in pending PBO transfer.

If you notify driver that you dont need previous buffer data (trash them by calling glBufferData(…, NULL)), then that glMap call should not stall. In that case you can hit some other barieres, like… driver is running out of memory, so sooner or later you will hit stall. This depends on usage scenario and CPU/GPU speed. If your app pumps too much data to GPU and GPU cannot process fast enough, then you can expect stall.

In my practice, the best approach is to create several PBO’s (4-8) and use them dynamically to load data. The biggest question is when is safe to map buffer again, so app will not hit stall? My guess is next frame. If you hit stall in next frame then your app pumps too much data, so GPU cannot process all. Static PBO is good because you can allocate them on app startup, so it will use fixed memory portion, avoid VRAM fragmentation, …

Many thanks for the this great information yooyo. I will try it asap and let you all know how it works.

Ido

Hi All,

I wanted to update you for future reference:
I’ve implemented the PBO approach, no context sharing in the threads, one unmap, draw, map and another thread uploads. I’m only using one PBO but I see tremendous speedup and much lower CPU while drawing and updating the texture. Currently I update the entire texture 256^3 RGBA with no problems on FX4600. I will try using smaller chunks and see if it make any difference (mostly on our old age FX1700).
The PBO methods works very well.

Additional question:
I remap the buffer once I’ve finish drawing, so the data could be uploaded while the main thread is resting (dma from memory to texture), does the map/unmap returns always immediately (now it does)? should I invalidate the data first, using glBufferData(…,0) so it always return with new pointer?

Many thanks,
Ido

does the map/unmap returns always immediately (now it does)? should I invalidate the data first, using glBufferData(…,0) so it always return with new pointer?

As we already said, mapping a buffer object is not instantaneous, the driver has to wait that the gpu releases it and it may cause an application stall if this one is one threaded.
The unmap operation should return immediatly.

To avoid application stalls, you can use the ping pong method like yoyoo said or invalidate the buffer object data reallocating it with glBufferData.

The last method is not very interesting if you have only little subsets of data to update since all buffer data is lost and need to be updated.