depth buffer readout performance

Anyone tried to maximize the readout performance of the z-buffer? What performance do you see on which hardware? What is a good combination of formats of the z-buffer and the destination buffer?
E.g. I see very poor performance, < 2MBps, with a 16bit depth buffer, short destination buffer on a X800 board :frowning: Other combinations I tried don’t look better. Is there any way to improve things (factor 10 is appreciated :wink: ?


In the recent shadow mapping project, I achieved 200-300FPS with a 6800GT on a 1024*1024 pbuffer with 24bits z-buffer. That’s about 600-900MB/s. I was using ARB_depth_texture extension.

I don’t really fancy to use the pbuffer for my app as it creates a huge unnecessary overhead. Additionally, on R3xx hardware at least pbuffer was slower than copytexsubimage.
In fact I just tried the ReadPixel of the z-buffer on a Quadro FX 1000 board and the performance I get there seems sufficient. So, what’s the deal to get it fast on ATI hardware then? I tried meanwhile also to copytexsubimage the z-buffer (using depth_texture_arb), but even this is extremely slow on my X800 (again 2MB/s) although this should not even require the transfer over the bus!
More ideas anyone?