PBO and texture loading/swapping speed

imported_Budric · June 2, 2009, 2:08pm

Hi,
this is another PBO question. PBOs are often suggested to improve texture loading speed. I’m having trouble understanding how exactly they help.

Currently I have a problem of 2 3D textures that don’t fit into video memory entirely. 2 texture objects are generated and loaded at startup before rendering. The rendering function renders an image using one of the textures, then the next call to the function uses another texture - so the driver would be doing some swapping. This is very slow and I’ve been looking for any way to speed this up.

I’ve read the following threads:
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=256344
and
http://www.opengl.org/discussion_boards/…true#Post257896
The suggestions are to use PBO to map driver’s kernel memory to user space and load the data in a separate thread. In my case would that help? I think I’ve already loaded the texture with glTexImageXX() at startup…

Thanks for any clarification.

ZbuffeR · June 2, 2009, 3:23pm

In your case the 2 textures are static, so PBO will not help.
What is your VRAM size ? And the size of each texture ?

Unfortunately for 2 texture it will not help, but for more, alternating the texture order will only need to transfer the missing bits, instead of all textures.
Example:

VRAM can contain 2 textures, but 3 have to be used.
naive rendering :

frame 1:
render using texture 1
render using texture 2
render using texture 3 <-- need upload, it will replace the older texture with is #1

frame 2:
render using texture 1 <-- need upload, it will replace the older texture with is #2
render using texture 2 <-- need upload, it will replace the older texture with is #3
render using texture 3 <-- need upload, it will replace the older texture with is #1

etc…

Optimized rendering :
frame 1:
render using texture 1
render using texture 2
render using texture 3 <-- need upload, it will replace the older texture with is #1

frame 2:
render using texture 3
render using texture 2
render using texture 1 <-- need upload, it will replace the older texture with is #3

etc… Steady states needs 3 times less texture upload.

EDIT: one way to do this in your case would be to split the textures in several parts, if your application can accommodate multipass rendering.
Can you give more details on your use ?
Maybe some form of compression can help ? Or lower mipmap levels ?

imported_Budric · June 3, 2009, 7:15am

Thanks your example makes sense. Basically it’s a volume renderer. You can load several volumes, turn on one or several at a time. The case of 2 being swapped is a simplified case I used to track down why I was going from ~20 fps to ~0.3 fps when you switch volumes when I’ve already loaded both with glTexImage3D().

I can try to implement multi pass rendering. Actually I have the real-time volume graphics book and they do discuss bricking and using PBO to speed up transfer using DMA. But as I said I don’t see where the potential speed increase comes from. If I need the whole volume to render the final image, and the driver is transfering at say 40 MB/second when it’s doing the swap to memory currently, why would breaking up the volume into several chunks and doing several passes help?

Thanks.

ZbuffeR · June 3, 2009, 8:03am

The ‘Optimized rendering’ I proposed above only helps for low values of VRAM overload, say when you need 110% or 120% of actual installed VRAM.
It will not really help if you need 3, 4, or more the texture memory size.
PBO can allow you to micro-manage the onboard texture data, and, if done very cleverly, might provide improvements in case the driver was doing stupid things.

the driver is transfering at say 40 MB/second when it’s doing the swap to memory currently

Seems quite low.

You did not answer :
What is your VRAM size ?
And the size of each texture ?

Your hardware/target hardware, OS etc ?

But don’t expect high performance for out-of-GPU-core rendering…

imported_Budric · June 3, 2009, 9:19am

I’m developing on ATI Radeon 2400 Pro with 256M of VRAM. Each texture is 512x512x512 16 bit so 256 MB (unsigned short computer memory), internal storage when calling glTexImage3D is GL_LUMINANCE12 (was hoping to save some VRAM, because 12 bit is enough resolution for the data, but 8 is not). No target hardware, the performance measurements should help determine that.

My code snippet that I"m timing is:


//Using my own GLSL shader
m_pSliceShader->useShaderProgram();
/*
Calls glActiveTextureARB() and glBindTexture if GL Texture initialize.  If not initialized calls glGenTextures and glTexImage3D()
*/
volumeData->glActivateTexture(GL_TEXTURE0_ARB);
m_pSliceShader->setUniform1i("volumeTexture", 0);
	
glDisable(GL_CULL_FACE);
glPolygonMode(GL_FRONT_AND_BACK, GL_FILL);
SW_RESTART();  //just a #define that uses a global timer variable
/**
This call uses some CPU but it's insignificant compared to what happens when glBegin(GL_TRIANGLE_FAN) call occurs.
*/
renderProxyGeometrySlices(viewVec,samplingRate,volumeData);	SW_ELAPSED_MSG("Render timer:");

The timed code takes anywhere from 3 to 4.8 seconds. So that’s 50 MB/s transfer. I should note that that piece of code gets called with alternating volumeData pointer. This is just a timing test I set up.

Bruce_Wheaton · June 5, 2009, 9:10pm

The speedup comes from minimizing the connection between the GPU and CPU. When a PBO texture transfer is started, the CPU can actually go about it’s job, and the GPU will get to it when it’s ready. With a normal texture upload, both the CPU and GPU need to stop what they’re doing and make the transfer. With your initial load, that would be fine, but if your images are too big for VRAM, and you need to dynamically upload smaller PBOs to re-stitch later, you get a nice (potential) speedup with almost no extra code.

Bruce

Bruce_Wheaton · June 5, 2009, 9:14pm

Oh, I should also note that once you’re breaking things down, you shouldn’t pick texture format based on memory - which really doesn’t help you anyway, the driver will be converting to something else on the fly then uploading and storing however it likes, and focus on transfer speed. There’s an app called transfer bench that lets you do a suite of tests. It’s often best to tell the GPU you’re sending 8-bit BGRA or 16-bit RGBA and then work it the way you want in a shader.

Bruce

Edit: why am I still a newbie? That’s lame.

imported_Budric · June 8, 2009, 7:03am

Thanks Bruce. I will try that application out.