Anybody ever used pixel buffer object multithread?

Hi, Everyone.

I’m actually beginner in OpenGL, but I think this might be an advanced topic, so if I posted in the wrong part of this forum, please tell me.

I am doing performance measurement for my project for texture uploading with pixel buffer object in multithread.
We used the glMapBuffer to GL_PIXEL_UNPACK_BUFFER_EXT and do the memory copy to the pointer returned by the glMapBuffer.
For single thread, it was good, but we need it to be faster.
Then we thought about uploading 2 texture at the same time using the same method in 2 threads.

So far I got pretty bad results.
Using 2 Intel Dual core 2.8 Ghz, 2 GB RAM and nVidia Quadro FX3450 machine, uploading 1920 x 1080 texture for every frame,
I got :

  • single thread : 114 - 120 FPS
  • 2 thread using only 1 Pixel Buffer Object : around 20 FPS
  • 2 thread using 2 Pixel Buffer Object : around 2 FPS

Has anybody ever tried the same thing that I did here and succeeded ? Or do you think I should scrap the idea ?

Thanks,
Celios

Hi,

your approach is logical but not the best way how to do it.
In my experience having two OpenGL threads (+ 2 respective contexts) always produce lower performance then single OpenGL thread doing all the work. (at least for 1 GPU)

My app also loads tons of textures and I get pretty good performance. My trick is this:
I have two (or more) threads:
#1: this is opengl thread that is doing all the rendering
have all the PBOs and textures. before rendering each frame
I check if there are any new data to load. If yes, then
I load it into texture from PBO. (unmapping it first)

#2-N: another N threads that just copy data from the origin
location (disk, img decompression, video input) and storing
the data into mapped PBO. This thread has no opengl context.
So the 1st thread must do all the mapping/unmapping of PBOs.

Number of PBO is about 10, to never block in thread for free PBOs.

If you have good framework for PBO queues then you get pretty good performace. For me the glTexImage takes less then 1ms
for really large textures (full HD).

Common mistake: Do not map the PBO just after returning from
glTexImage, otherwise you ruin async processing.

Hope this helps.

/marek

While I haven’t measured, I’d expect the fastest performance is a single thread (at a time) handled all data transfers to/from mapped buffers for simplicity, or a home-made locking primitive (*) if multiple threads otherwise could try writing data simultaneously, to stream a whole buffer fully before switching to another.

My reasoning behind this is that having multiple writers to different areas on the gfx card there would be frequent address changes on the bus (effectively every bus transaction), and lots and lots of bus transactions (each having overhead).

I admit I don’t know how good/bad e.g. PCIe is when the CPU or chipset write-combines whole cache lines as batches, but still interleaves them, vs. streaming a full buffer before switching address and streaming another, but if trying to squeeze the last nanosecond out of it, I’d at least research this. Given the speed of todays busses (PCIe x16) and RAM (both system and video) I’d expect transfer rates easily exceeding 2GB/s, with over double that not anything to raise an eyebrow over.

x86-specific; I’d also compare the relative performance of code using modern features vs. 3/4/586 style instructions (mov vs movnti+sfence is probably the most important of all, with the ‘prefetch’ instruction also possibly helping with streaming copies).

(*) I’ve myself meaning to look into the newer x86 monitor and mwait instructions where available - something I strongly suggest the IHV’s also do, instead of their current busy-waiting spinlocks.

++luck;

Thank you mfort and tamlin for your replies.

To mfort, you said that :
#2-N: another N threads that just copy data from the origin
location (disk, img decompression, video input) and storing
the data into mapped PBO. This thread has no opengl context.
So the 1st thread must do all the mapping/unmapping of PBOs.”

How do you store data into mapped PBo if you dont have GL rendering context in the thread ? Did you map the PBO in the first thread, get the pointer returned by the glMapBuffer and use it to store data in other threads ?

Also you said :
“Common mistake: Do not map the PBO just after returning from
glTexImage, otherwise you ruin async processing.”

How long should I wait after glTexImage before I can call glMapBuffer ?

Thanks,
Celios

Hi Celios,

Q1:
yes, you understood it right. You map the buffer in opengl thread. (Once the memory is mapped in OGL it is visible to all threads in the same process.) Then pass it (through some synchronized queue) to the “loader” thread that just fills the mapped memory. When finished send it back to OpenGL thread to execute glTexImage. (another queue). It is like multistage pipeline.

Q2: If you do glTexImage and glMapBuffer together, then it takes the same time as loading texture without PBO. So you must wait.
My best experience is one iteration of render loop. I store the PBO in a “waiting” list that is mapped back next time. Then send it back to pool of available PBOs for loader threads.

/Marek

mfort, thank you very much.
Although I’m still tweaking here and there, I am getting great improvements already.

Best Regards,
Celios

Hello, I decided to post here and not open a new thread, since my question is very similar :

When the N other threads ( with no GL context) need to operate on the data rendered by the GL thread the soonest possible, which is the best course of action?

In that case I guess the mapping of the buffers needs to be done immediately after the rendering to them, since the results are required from the other threads. Is there any other way for this CPU-GPU kind of dependency while multithreading??

Thanks in advance,
babis

Hi,

you should have more buffers then loader threads, it is good to have a pool of mapped PBO buffers.

Each loader thread gets one “free” buffer from pool, loads data and send it to “render” (thread safe) queue.
The render queue is read by OpenGL thread, PBO is unmapped, used for rendering and stored in a “wait” list.
The next rendering iteration (frame) in OpenGL thread you can map back all the PBOs in the wait list and return them to the pool of “free” buffers.

So it is more about CPU-CPU synchronization. Not big deal with GPU.

/marek

Hi!
I also decided to post here and not to create new thread.

As I understood, PBO helps when you have many textures uploading at a runtime. I mean, if you load all your textures only once on the program start, there’s nothing to deal with PBO, am I right?

I tried to test it myself and I got one problem.

The main question - has anybody got FP exception in OpenGL ICD driver, when calling glTexImage2D() with PBO enabled?

I create single PBO buffer for all the textures, and copy memory there (in mapped buffer) just before uploading. Everything works fine except one texture - I got FP exception on 5th MIP level.


   // copying via PBO
   IsgBufferPtr pboBuffer = holder->GetSmartLoadingBuffer();
   size_t buffer_size = image->GetDataSize();
   if (buffer_size > pboBuffer->GetSize())
      pboBuffer->BufferData(buffer_size, NULL, GL_STREAM_DRAW);

   // make PBO mapped copy
   BYTE * data_ptr = (BYTE *)pboBuffer->Map();
   memcpy(data_ptr, image->GetData(), buffer_size);
   pboBuffer->Unmap();

   // upload via PBO - so, zero pointer
   HRESULT hr = UpLoadData(NULL);
   GraphicEngine->BindZeroBuffer(GL_PIXEL_UNPACK_BUFFER);

All alignments are set to 1. No GL errors. Unmap also succeeded.

Hi,

I also get strange errors or app crash with PBO, I have found it
was caused by combination of driver version and pixel format.

Try another pixel format. The best is to use the format supported
natively by gfx card (like GL_BGRA).

/marek

Thank you for your reply, mfort!

Actually, now I’m using 169.21 forceware, they seem to be most stable nowadays.
About format - I have to use that format, which is in DDS file, and I can do nothing with that. I’ll make test app tomorrow to check whether it’s a driver bug or not.

And last question - am I right with PBO unusefullness on single-time textures loading?

Hi Jackis,

one-time textures:
It depends on your engine and its demands. If you need sustained frame rate and time to time you load big textures, then the PBO is definitely a benefit. Without the PBO the glTexImage can take like 5ms and you can see frame drop because of the loading. If you use PBO the glTexImage takes 0 time and your engine does not drop frame. But the PBO handling brings some complexity into your code, so it is up to you.

If you load the textures at some init stage, then PBOs are not usefull.

/marek

Not that I recall. However, I’d suggest calling glTexImage2D() with a null pointer and no PBO bound (to pre-allocate the texture MIP) and then use gl*TexSubImage2D() with PBO offset and PBO bound to actually do the upload. The latter is what you’re going to do in a loop.

Good point.
it is a must for texture streaming with good performance.

/Marek

Thanks all, great post and good idea with the “mapped-PBO-pool” (mfort).

I also found this PBO explanation (OpenGL Pixel Buffer Object (PBO)) and downloaded a snippet of code that I decided to mess around with.
What I noticed in the PIXEL_UNPACK example, was that transfering smaller chunks of pixels using 2 PBOs had almost no benefit. Only when downloading (-> GPU) more than about (50k bytes - 16000 pixels) I began to see a difference with PBOs. I guess there’s a slight overhead from PBO, but is seems small to make this approach usable in most case. This overhead obviously varies on different hardware and implementation, so this number shouldn’t be taken for more than it is. I justed wanted to share this in case other people only want to transfer a small number of pixels.

EDIT:
My little informal testing reveals that using 1 PB0 for the streaming of a 4 Mb texture, did seem decrease the CPU usage some 15-20%, and using 2 PBOs decreased CPU by about 50%. Using BGRA over RGBA also decreases the CPU usage by a few percent. Looks like the PBO extension is quite worth using for large transfers and that formatting BGRA as recommended by NV helps a bit too if low CPU usage is a concern (no swizzlin in the driver takes place as I understood from the white paper)

@Tamlin:
Highest transfer rate I noticed was 2.6 Gb/s streaming a large texture to a 8800GTS on PCIe x 16 (pbo mode 2 in the above article). Granted this card doesn’t have the fastest memory on the market.