PBO Performance (yet again)

Hi,
i am working on an out-of-core volume renderer using a texture atlas. The atlas is quite large so double buffering to get asynchronuous atlas updates are out of the question. My try is to use a smaller PBO which receives the data that is to be updated in the atlas. I can write asynchronuously to the PBO while using the atlas to render the volume. Now my guess was to get moderate throughput into the PBO and higher throughput from the PBO to the atlas, due to the given high on device bandwidth.

The PBO got a DYNAMIC_DRAW usage flag. I measured the throughput to the PBO to be around 2GiB/s which is really great. But using glTexSubImage3D from the PBO to the atlas only gives me around 1GiB/s which i think is quite low. I tried to find out the maximum performance i can expect from a PBO using the glTexSubImage3D. So i used different usage flags to indicate only static usage of the PBO (as well as stream usage from GPU to GPU memory) while not touching the PBO data. This means no glMapBuffer or glBufferData calls, just the initialization using NULL and the usage with glSubTexImage3D. But i was not able to get better performance than around 1GiB/s from the PBO into the 3D texture.

I am running Windows XP x64 using a GTX 280 with ForceWare 180.43. Are there things i can do to improve the unpack performance from the PBO? The voxel format is simple unsigned char GL_LUMINANCE.

Regards
-chris

How do you measure the performance of glTexSubImage3D ?
Try GL_R8 instead of LUMINANCE. (GL_ARB_texture_rg)
Are you sure the texture is not in use for rendering in the pipeline?

Use big textures. There is always some work in the driver no matter how big the texture is. Use something like 1024x1024 with luminance8 packing.

Test 512x512 2D texture with ABGR pixel format and compare results.