Retrieving average 32F depth buffer value

remdul · February 21, 2017, 4:02pm

What’s the (generally) best/fastest way to retrieve the average depth buffer value?

My idea was to generate depth buffer texture mipmaps on the GPU, and read back the lowest (1x1) mipmap level (waste less bandwidth). However, I’m using 32-bit floating point depth buffer (i.e. GL_DEPTH_COMPONENT32F_NV). Does GL/hardware even support mipmap generation for this format? Would it be be slow or well-supported?

Also, would reading back a 1x1 mipmap incur a pipeline stall?

It would be an acceptable solution if a few pixels are dropped (i.e. copy depth buffer to half-res 8-bit FBO for mipmap generation) and if the result would 1 frame delayed (for real-time use case).

Would CUDA offer anything in this respect (for exact, non-real-time use case)?

Are there alternatives approaches? I suspect some games out there do something similar with HDR frame buffers to handle auto-exposure.

Does anyone have experience with this? Any hints, links to papers covering similar territory are welcome.

Dark_Photon · February 22, 2017, 5:44am

That’s going to depend on your GPU+driver+workload.

Also, would reading back a 1x1 mipmap incur a pipeline stall?

If you try to read back the value just rendered immediately in blocking fashion after submitting draw commands for that render, there’ll be some stall. How long depends on your GPU+driver+workload.

For instance, suppose you’re on a desktop/discrete GPU and you’re 4ms into your frame but it’s going to take the GPU another 10ms to actually finish processing all the draw commands you’ve queued. Even on a desktop, expect to wait 10ms + time for the downsample (if applicable) and readback to get your result.

If you’re on a mobile GPU, it’s much worse as they tend to do fragment work a frame later (when driven properly). So in the above case, expect to wait 16-32ms for the readback result if you’re vsynced and double-buffering. Also, this readback if done in blocking fashion triggers a full pipeline flush so it can cause stuttering and sometimes visual artifacts.

My idea was to generate depth buffer texture mipmaps on the GPU, and read back the lowest (1x1) mipmap level (waste less bandwidth).

If it’s efficient in the driver, this could definitely reduce the amount of data that needs to be read back. This is particularly important for driver implementations (e.g. NVidia GeForce) that cripple readback performance on consumer cards (though there are a few tricks you can use).

However, regardless, your readback is still going to need to wait until the draw work completes, which is part of the cause of the stall.

Other options you could use to do this N->1 data reduction: Besides MIPmap gen, you could do this reduction with a simple GLSL fragment shader, compute shader, or CUDA/OpenCL kernel (I’d recommend against OpenCL on NVidia’s implementation for missing sync reasons, last time I checked) where you have complete control. This can be useful when you need min/max or other non-average statistics (sometimes useful when crunching the depth buffer).

Are there alternatives approaches? I suspect some games out there do something similar with HDR frame buffers to handle auto-exposure.

Here are a few.

To avoid the stall caused by having to wait for the draw work this frame to complete, readback and use the value from the “last” frame instead. This works if there is reasonable temporal coherence in your scenes. To implement, after rendering a frame, readback the result to PBO, but don’t read it back to the CPU yet. That shouldn’t cause a stall. Next frame, read back that result to the CPU and use it in your computations. If done properly, this can completely avoid the stall induced by having to wait for the draw work for the current frame to complete to get a useful value.

Another option would be to just keep the generated image statistics (avg, min, max, etc.) on the GPU, and have the GPU pick them up in a shader directly from GPU memory. Then there’s no need for the CPU to be involved in this readback operation. The GPU is generating data for the GPU, so it should pipeline well without a stall (assuming efficient drivers).

remdul · February 26, 2017, 5:08am

Thanks for the advice, that gives me some directions to explore.