Performance of texture upload with PBO

Dark_Photon · July 20, 2010, 12:15pm

Yeah, use whatever your compiler sets to tell you SSE2 is available for this compile.

Or just for testing, replace this with “true” if you know your dev box supports SSE2. See this link:

http://en.wikipedia.org/wiki/SSE2#CPUs_supporting_SSE2

All 64-bit boxes have it.

I guess I’ll have to bite the bullet and learn to use Visual Studio.

Or just use Linux/GCC. It’s free.

That said, I find that the built-in memcpy on GCC 4.4.1 is even slightly faster than the gamedev SSE2 non-temporal memcpy on our app’s test data (batches streamed to VBOs) on Core i7 920, at least under -O2 (optimization level 2). They’re pretty close though.

James_W_Walker · July 20, 2010, 1:45pm

Hmm, Visual Studio is also giving me an error on __sse2_available. I suppose I need to include some header, but what? My Google-fu has failed me.

James_W_Walker · July 20, 2010, 3:54pm

I managed to get Visual Studio to build me a DLL using the SSE2 memory copy function. I made sure it was compiled with optimizations and intrinsics, and verified that it was taking the SSE2 code path. But it still didn’t get significantly below 13 ms on the copy.

yooyo · July 21, 2010, 4:15am

Whole point behind PBO is to alow CPU and GPU runf without wainting on each other. If you neet to stream video to GPU, use this:

create PBO pool, each PBO buffer should be able to fit whole frame. Map them all and mark as mapped.
from decoder thread, when frame is decompressed, ask PBO pool for one unused and mapped PBO pointer. Copy frame to it and mark as filled with data.
Depending on decoder, you can even pass PBO pointer directly to decoder and it can decode frame in directly PBO buffer. This will avoid one memcpy call. Be carefull, if some decoders try to read data from this buffer, it can slowdown.
from rendering thread, once per frame, check PBO pool status.
- if some PBO is marked as uploading (I assume that uploading from PBO buffer to texture will be done in one frame) map its pointer and set its status to mapped
- If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Depending on number of stream you want to play, use 4 or more PBO’s in pool.

To readback data you need two PBO buffers. Issue glReadPixels on PBO1, map PBO2 and copy data to sysmem or output video card, unmap PBO2 and swap PBO buffer names.

James_W_Walker · July 21, 2010, 1:51pm

yooyo, thanks, but there are a couple of things that still confuse me.

First, when I started this topic, I referred to an example in the PBO specification, and that example did not use threads. Was it a poor example?

Second, if you’re going to use threads, I’m not sure I see why PBOs are needed. Couldn’t you just have one thread that does texture uploads directly with glTexSubImage2D, and another thread that renders with the textures?

James_W_Walker · July 21, 2010, 2:05pm

OK, maybe I can answer my own question about why use PBO if you’re going to use threads. I guess the simpler approach would not work well if you have only one processor, because while glTexSubImage2D was uploading synchronously, nothing else would be getting done. Right?

yooyo · July 21, 2010, 5:12pm

This is just easies possible example. Not designed for real world usage.

retro009 · October 7, 2010, 12:01pm

If some PBO buffer have some data (status = filled with data), unmap that PBO and call glTexSubImage2D. Mark PBO as uploading. Do not use that texture in current frame, because glTexSubImage2D may not be finished yet, so GPU will wait until texture object isnt ready to use.

Hello,
how do i know when unmapping (glTexSubImage2D) is finished?

Pierre_Boudier · October 7, 2010, 12:07pm

you can insert a fence with a sync object after texsubimage, and query the status when you want to reuse your pbo.