I believe there are two gains in the current Z implementation:
- Hierarchical Z approaches
- Compressed Z approaches
I believe they are conceptually orthogonal (though probably can get extra efficiencies by being combined in clever ways).
Here’s my mental model of hierarchical Z:
A coarse grid (say, on 8x8 or 16x16 basis) stores the highest and lowest values found within a block, possibly using some lower precision like 16 bits (with appropriate rounding). At that point, Z testing can be done for many cases in a simple operation that throws away an entire block (64 or 256 pixels – you’d probably even get decent gains at 4x4)
Here’s my mental model of Z compression:
A block of Z values (say, 4x4 or 8x8) is compressed using some mechanism that could be lossless if Z is “well behaved”. If lossless compression cannot be accomplished, then uncompressed Z is stored in memory. When the memory controller reads in the data, it decompressed on the fly, if the block is compressed. You have to reserve memory for a full, uncompressed block for the entire framebuffer, because the compressibility of each block can change quickly. The win is that the memory controller needs to read much less data if the block is compressed, and thus you get a speed-up as long as actual transfer is your bottleneck.
Possible synergies: Use hierarchical Z values to drive the interpolation for compression, a la DXT5 compression. Use the hierarchical Z data to determine whether the block is compressed or not.
Another possible Z compression model would be to pick one value, and store some number of derivatives off this value, and then store per-pixel some offset from this implied surface, very similar to ADPCM for audio.
I’m pretty sure that I don’t have all the details right here, but these models have, so far, served me well in predicting behavior, so I stick to them