Early Z rejection

Not necessarily, there are setup overheads if a triangle isn’t trivially rejected and that has nothing to do with pixel fill limitations.

First, that’s nothing strange in 4 FPS in this test. I don’t know screen resolution, let it be 1024x768 (just for estimating), GeForce 6800 GT has 6 pipelines x 700 MHz, then it can render 16942=676 instructions with 3500000006/(1024*768)/(676) = 3.95 FPS. Software render might be 1000 times slower =).

Futher, i guess this test is generally incorrect.
I don’t know, is it a driver bug or someone’s else :wink: , but EarlyZ culling should be used by different means. Try the following: render some full-screen quad with lesser Z with depth write on and color write off, and after (in second pass) render it with depth test GL_EQUAL, depth write off and color write on. Not ever glClearDepth needed.

PS: sorry for my english.

Originally posted by cppguru:
First, that’s nothing strange in 4 FPS in this test. I don’t know screen resolution, let it be 1024x768 (just for estimating), GeForce 6800 GT has 6 pipelines x 700 MHz, then it can render 16942=676 instructions with 3500000006/(1024*768)/(676) = 3.95 FPS. Software render might be 1000 times slower =).
Maybe I misunderstood, but I thought the 4FPS was measured for the shader-disabled test.

Originally posted by Robert Osfield:
[b]

Back on topic… I’ve also struggled to see much improvement in tests with z early tests. From the card manufacturers I’d love to see a proper explanation of how the z ealy tests are implemented and how to utilize z early tests.
[/b]
I disagree with Robert on many things :slight_smile: , but this one I can only agree with. Some good examples of how to utilize early Z would be VERY appreciated. I’m not asking about how they are implemented (all that secrecy around HW implementations will prevent that from happening, even though it would be interesting), but this is a potentially extremely beneficial feature that is apparently not as trivial to utilize as it seems.

For example the comment above that you have to turn off depth writes caught me very much by surprise. I haven’t verified it yet, but something like that definitely needs to be documented better.

Simon, Humus, how about writing a little demo for this? Please!

A related question: does the early z (if you can get it to work :wink: ) also accelerate the occlusion query or only actual rendering? Does anybody have any experience with this?

Simon, Humus, how about writing a little demo for this? Please!
there is a demo on humuses site

A related question: does the early z (if you can get it to work :wink: ) also accelerate the occlusion query or only actual rendering? Does anybody have any experience with this?
this is mentioned in the occlusion spec. draw first a depth pass with occlusion enabled. and then query the result to see what meshes are visable.

Originally posted by zed:
there is a demo on humuses site

Ah, great! I hadn’t seen that before. It took some hacking to make it work on my Linux/nVidia system (is there really no way to set the X visual for GDK???), but it seems to work fine now (at least the ftob numbers are much bigger than the btof ones :wink: ) Thanks for the hint!

this is mentioned in the occlusion spec. draw first a depth pass with occlusion enabled. and then query the result to see what meshes are visable.
Hm, I can’t find any mention of early or hierarchical Z in the spec. Given that it’s going to be system-specific I didn’t expect to find any, really.

I’ve implemented a ray casting algorithm on NV40. I use early depth testing and depth bounds testing for computation masking. Rendering is done on a PBuffer. It works, though with some strange restrictions:

-If multiple pbuffer’s are used, it works only for the firstly created pbuffer (strange, maybe it’s my fault but I could not find a way to make it work for the second pbuffer)
-Breaks down after context switching. Even aviod making the same context active; calling glActiveContext() kills optimization.
-And some other known things: depth mode must be LESS or LEQUAL…

When depth bounds testing activated, early depth culling works faster.
Anyone tested early depth testing with FBO ? I wonder if it is working correctly.

These don’t seem that strange when you consider that a coarse z scheme requires chip resident cache and estimation of farthest and possibly nearest z and additional hardware based tile level comparrisons (source tile nearest vs coarse destination farthest) and so may support a limited subset of operations for early rejection. That chip resident fast coarse z memory would be a scarce resource and may only be available for one buffer. The useage could be primitive with no paging & management etc and so only a single buffer may be supported. Context switching may not back up & restore it (may not even be possible given the hardware).

On top of all this it’s gonna be optimized to hit the benchmark cases and even if possible a driver may not offer the coverage you would like for exotic comparrison, multi-puffer and context switching modes.

Originally posted by Zulfiqar Malik:
[b] [quote]
Originally posted by Humus

At those triangle loads I would expect you to be limited by vertex shading, not fragment shading, so I’m not surprised if you don’t see much of an increase.

My vertex shader is not heavy at all and i shouldn’t theoretically be limited by vertex shader performance considering the amount of vertexes current GPU’s can process (my GPU is not exactly new :slight_smile: but it CAN theoretically process more than 250Million vertexes per second).[/QUOTE][/b]Your 5700Ultra? Frankly, no.
I can squeeze 142Mverts/s through my vanilla 5700 (425MHz core, 275MHz memory), but there’s a setup limitation at 70MTris/s. I.e. that’s the peak for strips/fans, and the absolute maximum triangle rate the chip can support under whatever circumstances, at that clock speed.

I’d conclude that you’re doing pretty well on vertex performance with little room for improvement, if at all.

Seems that Humus’ Early Z rejection test program also works on nVidia hardware :slight_smile:

Originally posted by zeckensack
[b]
Your 5700Ultra? Frankly, no.
I can squeeze 142Mverts/s through my vanilla 5700 (425MHz core, 275MHz memory), but there’s a setup limitation at 70MTris/s. I.e. that’s the peak for strips/fans, and the absolute maximum triangle rate the chip can support under whatever circumstances, at that clock speed.

I’d conclude that you’re doing pretty well on vertex performance with little room for improvement, if at all.
[/b]

Well, i did achieve a fillrate of around 92MTris/s (triangle LISTS) but a lot of those were getting rejected at the rasterization stage. But still, that gives a good estimate of the number of vertexes the hardware can process. I read somewhere that the peak theoretical vertex fill rate for 5700 ultra is around 240 MVertes/s.

Btw, i recently got a 6800 GT an i crossed the sweet 100 MTris/s spot, using my algorithm. I was easily getting around 105 MTris/s. A few more optimizations and i can even increase that !