Performance of texture upload with PBO

Yeah, use whatever your compiler sets to tell you SSE2 is available for this compile.

Or just for testing, replace this with “true” if you know your dev box supports SSE2. See this link:

All 64-bit boxes have it.

I guess I’ll have to bite the bullet and learn to use Visual Studio.

Or just use Linux/GCC. It’s free.

That said, I find that the built-in memcpy on GCC 4.4.1 is even slightly faster than the gamedev SSE2 non-temporal memcpy on our app’s test data (batches streamed to VBOs) on Core i7 920, at least under -O2 (optimization level 2). They’re pretty close though.

Hmm, Visual Studio is also giving me an error on __sse2_available. I suppose I need to include some header, but what? My Google-fu has failed me.

I managed to get Visual Studio to build me a DLL using the SSE2 memory copy function. I made sure it was compiled with optimizations and intrinsics, and verified that it was taking the SSE2 code path. But it still didn’t get significantly below 13 ms on the copy.

Whole point behind PBO is to alow CPU and GPU runf without wainting on each other. If you neet to stream video to GPU, use this:

  • create PBO pool, each PBO buffer should be able to fit whole frame. Map them all and mark as mapped.
  • from decoder thread, when frame is decompressed, ask PBO pool for one unused and mapped PBO pointer. Copy frame to it and mark as filled with data.
    Depending on decoder, you can even pass PBO pointer directly to decoder and it can decode frame in directly PBO buffer. This will avoid one memcpy call. Be carefull, if some decoders try to read data from this buffer, it can slowdown.
  • from rendering thread, once per frame, check PBO pool status.
    • if some PBO is marked as uploading (I assume that uploading from PBO buffer to texture will be done in one frame) map its pointer and set its status to mapped
    • If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Depending on number of stream you want to play, use 4 or more PBO’s in pool.

To readback data you need two PBO buffers. Issue glReadPixels on PBO1, map PBO2 and copy data to sysmem or output video card, unmap PBO2 and swap PBO buffer names.

yooyo, thanks, but there are a couple of things that still confuse me.

First, when I started this topic, I referred to an example in the PBO specification, and that example did not use threads. Was it a poor example?

Second, if you’re going to use threads, I’m not sure I see why PBOs are needed. Couldn’t you just have one thread that does texture uploads directly with glTexSubImage2D, and another thread that renders with the textures?

OK, maybe I can answer my own question about why use PBO if you’re going to use threads. I guess the simpler approach would not work well if you have only one processor, because while glTexSubImage2D was uploading synchronously, nothing else would be getting done. Right?

This is just easies possible example. Not designed for real world usage.

  • If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Hello,
how do i know when unmapping (glTexSubImage2D) is finished?

you can insert a fence with a sync object after texsubimage, and query the status when you want to reuse your pbo.