make a series of mipmap of the original tex, and use a pixel shader to find the maximum value of adjacent 4 pixels in the finer mip level and output the result to the next coarser level, repeat it up to last mip, which only have one pixel.
i did it this way,however have no idea if it is most effective.
The “most” efficient? Well, that’s probably going to be with an OpenCL or CUDA kernel crunching the depth buffer (but this is not a “beginners” solution). I’ve done that using OpenCL to implement Sample Distribution Shadow Maps aka SDSM. With an OpenCL/CUDA solution, you have the control to pack work into the cores so they’re all always busy, have them cooperate and do thread synchronization, use local GPU core shared memory to accelerate the reduction, tune your memory accesses to avoid shared memory bank conflicts, unroll loops, delegate multiple blocks of work to GPU cores, etc.
It may sound hard, but it’s actually not because the vendor’s have already coded this up. NVidia’s got source code out there to implement 1D reductions, which you can extend to 2D without much trouble.
Just a few notes on this approach: AFAIK neither CUDA nor OpenCL can read from an OpenGL depth buffer yet, so you end up rendering depth to an R32F color buffer as well, and doing the GPU reduce on that (clCreateFromGLTexture2D). Also, at the time I did this, ARB_cl_event / cl_khr_gl_event weren’t supported by the NVidia CL/GL drivers (don’t know if they are now) which essentially required a full glFinish before flipping to CL and clFinish before flipping back to GL (heavyweight sync). The solution was still amazingly fast, but these are two shortcomings that may reduce the advantage of an OpenCL/CUDA solution.
As a first cut (baseline performance case), just do a glReadPixels of the entire depth buffer and do the crunch on the CPU. Don’t know what GPU(s) you’re targetting, but on NVidia you’ll be amazed at how fast this can be on a pre-Fermi (e.g. GTX285) card with a decent CPU. On these cards, I just use this approach because it’s so fast and simple. However they’ve crippled readback perf on Fermis (e.g. GTX480+) so this approach is a real loser there with a decent size depth buffer, which is what got me looking at an OpenCL/CUDA solution (probably their goal when they crippled readback; grrr…). You can try reducing your depth buffer res though.
However, for a fast solution other than ReadPixels (which isn’t fast on Fermi+) that doesn’t require OpenCL or CUDA, using GLSL shaders to do the reduce in a MIPmap-like generation style as robotech_er suggested is simple and probably pretty fast, though it doesn’t make optimal use of the GPU cores and memory. Possibly good enough for your needs though. Look at NV_texture_barrier for one option to speed up the ping-pong reduction.
As a GLSL++ approach, you can probably use some of the new goodies in ARB_shader_image_load_store to accelerate things. This adds some of the cool features from OpenCL into GLSL. I need to spend some time soon getting my head in this extension.