Performance of texture upload with PBO

James_W_Walker · July 15, 2010, 1:09pm

In the ARB_pixel_buffer_object specification, Example 2 shows a way to upload texture data using a PBO. The basic outline is that you create a PBO of the right size, map the PBO into memory, copy pixel data into the PBO with memcpy, unmap the PBO, and then upload from the PBO to a texture with glTexSubImage2D. When I tried using this method to upload an image of size 1858 x 1045 into a nonrectangular texture, the memcpy took (on average) around 13 ms, and the glTexSubImage2D took around 7 ms. On the other hand, if I don’t use a PBO and just use glTexSubImage2D directly, it takes around 7 ms.

What am I missing? What’s the advantage of the PBO method?

yooyo · July 16, 2010, 8:47am

Texture streaming.
Without PBO, during glTexSubImage2D call, CPU is blocked and wait. With PBO, glTexSubImage2D call return immediatly.
Dont use plain memcpy. Use some faster memcopy code that uses MMX/SSE instructions. Google for it.
In video streaming, use two PBO’s… decode in PBO1 but upload to texture from PBO2. Then swap PBOs.

ZbuffeR · July 16, 2010, 8:58am

Without PBO, glTexSubImage2D call, CPU is blocked and wait.

From hearsay, this is not completely true. An optimized GL driver only has to copy data on its side, then return, and then send data asynchronously to the GPU.

James_W_Walker · July 16, 2010, 10:40am

That’s what I’ve read, but as stated above, that’s not what I measured. So even if I could do the memcpy in zero time, I still wouldn’t have a win.

James_W_Walker · July 16, 2010, 11:23am

Now I’m more confused… I did the timing again, with a somewhat bigger image size (1920 x 1080) and got very different results: about 13.5 ms for the memcpy, but only 0.1 ms for glTexSubImage. Maybe I did something wrong before.

As for the memory copy speed, this is on Windows, and I tried using the CopyMemory function provided by the OS, with basically the same results. I’d think that Microsoft would have optimized the hell out of CopyMemory, or am I being naïve?

Arkh · July 18, 2010, 8:07am

In my situation, I’m using a single PBO to upload data to the GPU. For 1920*1080 RGB / BYTE, it takes about 2ms to send the whole data (not far from the bandwidth limitation : 3GB/s), fluctuations show min is near 1.7ms and max around 2.3ms.

Platefroms : WIN 7 x64 (both)
drivers : 257.XX (for my laptop only, I don’t know for the desktop)
CPUs : I7 920 (desktop) I7 950 (laptop)
GPUs : FX 3800 (desktop) GTX 280M (laptop)

The Code i’m using :

glBindBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, bufferID);
void* ptr = glMapBufferARB(GL_PIXEL_UNPACK_BUFFER_ARB, GL_WRITE_ONLY_ARB);
memcpy(ptr, data, size);
//then unmap PBO
glBindTexture(GL_TEXTURE_2D, texureID);
glTexSubImage2D(GL_TEXTURE_2D, 0, offsetX, offsetY, w, h, mode, depth, 0);
//then unbind PBO

Are you sure about the size you are copying?

yooyo · July 19, 2010, 7:28am

memcpy and CopyMemory is slow. Try with:

James_W_Walker · July 19, 2010, 10:50am

I did leave out some relevant details. My image data actually contains 2 side by side images, and I’m uploading each half to separate textures. I realize now that I can be a little smarter and upload the data to a PBO just once, and use that PBO for both glTexSubImage calls. However, even taking into account the factor of 2, the time I’m getting seems a lot higher that what you report.

OS: Windows Vista Home Premium 64 bit
CPU: Intel Core 2 Duo
GPU: ATI Radeon HD 4670

James_W_Walker · July 19, 2010, 12:26pm

Thanks, but the thread at your first link refers to code at a dead link, and an Intel library that was then available for free but is no longer. The second link is designed for AMD processors, and it’s not clear to me exactly what assumptions it makes.

James_W_Walker · July 19, 2010, 12:38pm

D’oh! I now see that CopyMemory is a macro defined to be RtlCopyMemory, and RtlCopyMemory is a macro defined to be memcpy! So maybe Microsoft doesn’t have any memory-copy function in a DLL, just a memcpy in a C runtime library. And now another confession, I’ve been using CodeWarrior rather than Visual Studio, hence a really old C runtime library. That may explain the slowness.

James_W_Walker · July 19, 2010, 12:54pm

I tried the AMD assembly code, and it may be a tad faster, say 12.9 ms rather than 13.5.

mhagain · July 20, 2010, 7:42am

I’m also hitting a similar problem, where the supposed perf increase from using a PBO is just not happening. The code goes like:[ul][li]BindBuffer[]MapBuffer[]Update regions of the mapped area that need modifying, building a rect that desribes the size of the total updated area.[]UnmapBuffer[]TexSubImage[*]Unbind (BindBuffer, 0)[/ul]This is in a performance critical path and I need to be able to do 30-40 of these per frame. Textures are 64x512. The entire texture rect is not, however, being updated; only a subrect is, so BindBuffer with a NULL data pointer is not an option.[/li]
The annoying thing is that I know the hardware (Intel 4 Series) is capable; I have equivalent D3D code that handles it smoothly and almost for free (and gives you 80,000 verts per frame in addition), but OpenGL just stutters and stalls.

Is accessing the PBO serially more efficient than hopping around in it? Would there be benefit to keeping a copy of the texture data in system memory, hopping around in that to update, then copy to PBO and TexSubImage it?

Maybe a driver problem (I did say Intel) but I want to ensure that I’m using the correct optimal path before bashing at that.

mfort · July 20, 2010, 8:26am

@mhagain

Updating only part of the buffer is very bad thing.
The driver copies the whole buffer.
Use smaller PBOs.

If you want better performance then do the memcpy in another thread.

Dark_Photon · July 20, 2010, 9:07am

Thanks, but the thread at your first link refers to code at a dead link, and an Intel library that was then available for free but is no longer. The second link is designed for AMD processors, and it’s not clear to me exactly what assumptions it makes. [/QUOTE]
The first link works fine for me here. Suspect net filtering on your end.

And the second link (both actually) appear at first glance to be generic MMX. This formulation is just a little inconvenient to integrate in a C/C++ app since it’s raw asm. Also, seems this is doing MMX 64-bit moves. Whereas with SSE2 (supported by all 64-bit CPUs and many 32-bit) you can do 128-bit moves.

So instead…

Here is C <emmintrin.h> code for that same concept – that is, a non-temporal (non-cache-polluting) memcpy, which AMD terms “Streaming Store”, but which uses SSE2:

gamedev.net source code

This concept behind all of these (but especially the previous link) is explained more fully (in English) here:

Performance Optimization of Windows Applications on AMD Processors, Part II(read the whole page, or search down for Streaming Store)

Works fine on Linux/GCC too though. If you’re compiling under GCC, this is how you can test whether this compilation supports SSE2:

#if !( defined(GNUC) && defined(SSE2) )

if not, you can fall back to system memcpy.

Maybe you could add that SSE2 gamedev source code to your bake-off and post some comparative times. Be sure not to operate on the same data in the same prog run without a cache flush, and flip the order of your tests a few times to ensure your timings are in-fact independent. Separate test prog runs with diff algs each time is probably safest.

James_W_Walker · July 20, 2010, 10:23am

I guess I wasn’t clear… when I said “the thread at your first link refers to code at a dead link”, I didn’t mean that your link was itself a dead link, I meant that the thread is talking about code from http://www.joryanick.com/memcpy.htm, which is a dead link.

What’s emmintrin.h?

Anyway, I’ll give that code a go.

mhagain · July 20, 2010, 10:27am

Sadly I don’t know in advance how large or small the update region is going to be, so I can’t do that. Secondly I’ve now established that the culprit is definitely the call to glTexSubImage2D; comment out that call to that, even with the map/update/unmap left in, and it’s smooth as silk, up to about 5-6 times the performance; the PBO is not what’s causing the bottleneck here, it’s glTexSubImage2D for definite. To rule out a usual suspect, I have also checked for BGRA.

Use smaller PBOs.

Been there, done that, wear the t-shirt down the pub every friday night. Doesn’t help.

If you want better performance then do the memcpy in another thread.

Won’t help, the copy of data to PBO is not the bottleneck, it’s the copy from PBO to texture object.

mfort · July 20, 2010, 11:01am

check this out
http://www.opengl.org/registry/specs/ARB/copy_buffer.txt

In my experience glTexSubImage2D with PBO always takes zero time even with uploading BGRA 1920x540

James_W_Walker · July 20, 2010, 11:21am

Looks like my old CodeWarrior compiler can’t cope with this. It has never heard of __sse2_available, and gave a bunch of “register spilled” warnings that I didn’t know how to deal with. I commented out the __sse2_available part and tried it anyway, but it was slower than my older code. I guess I’ll have to bite the bullet and learn to use Visual Studio.

mhagain · July 20, 2010, 11:25am

It’s a nice extension but it’s not available on my hardware, and won’t be available on 75-90% of the users hardware either. It’s annoying because it’s not a problem in the D3D version of the app.

I think at this stage I need to make a standalone app that I can beat on with this.

Dark_Photon · July 20, 2010, 12:10pm

It’s a cross-platform (apparently) x86/x86_64 header file that provides compiler symbols (“intrinsics”) that compile to SIMD instructions such as MMX, SSE, SSE2, etc. For instances, see these links:

http://msdn.microsoft.com/en-us/library/ba08y07y.aspx (Microsoft on SSE2 integer ops)
http://www.codeproject.com/KB/recipes/mmxintro.aspx (ancient MMX docs)

I say apparently cross-platform because the SSE2 memcpy from gamedev allegedly compiles on MSWin with MSVS, and it compiles/runs fine for me on Linux with GCC. All I had to do was change this thing: “!__sse2_available” to:

#if !( defined(GNUC) && defined(SSE2) )

which has nothing to do with this header file.