Early Stencil Rejection

Hey Peoples!

Searching around this site and others, it is seemingly true that early stencil rejection (ie. before fragment shader processing) is non existant. I do not like the sounds of that at all!

Im currently looking into the prospects of deferred shading, and feeling quite keen about the whole deal. I’ve performed some very trivial stencil tests with arbitrarily sized triangles over a very shader intensive full screen render, but performance does not increase at all (whereas a scissor test most definitely does!). Ive tested that with forward rendering to the back buffer directly and with a FBO.

But still im confused. Take for example this nvidia presentation on deferred shading. http://download.nvidia.com/developer/pre…red_Shading.pdf

It clearly states that stencil culling is a possible optimization. If the whole region (that passes the depth test) to be rendered (say the bounding sphere of a point light) will be calculated for lighting but then rejected by the stencil test after calculation, how is this an optimization at all?

There must be something im missing here.

For the record im running an AGP 7800 GS with nvidia drivers 100.14.19 on amd64 linux.

Cheers for any input guys!

It is rather tricky to get culling working. I never have with stencil; early-z (depth culling) does work, but disables itself sometimes when I wish it wouldn’t.

All you can really do is try different combinations until you notice the speedup. For instance, writing stencil while rejecting based on stencil doesn’t work.

I’m sure there are valid reasons for these deficiencies, but I do wish more control over them was possible. Sometimes things which should obviously be fine in most use-cases aren’t implemented because there’s one case where some logic would be broken if they were. Let the programmer work around that, I say.

I seem to recall that on Nvidia you don’t get early stencil rejection if you are updating the stencil value at the same time. (eg. you can only test against the stencil value)

I think ATI also allows updates, but check the tech docs from both companies.

Look at my post:

I’m currently adopt the Penumbra Wedges algorithm and performance don’t satisfy me. I’ve found that disabling/enabling stencil test (which is important optimization and maybe even requirements of fast implementation of this algorithm) didn’t change performance at all! Then I wrote simple OpenGL test with full-screen rendering and heavy shader to discover mysels a behavior of early tests. The results are: ATI support all early test without any tricks (until your OpenGL-code satisfy the architecture of graphics pipeline). NV supported early depth and scissor rejections very good, but have some troubles with early stencil rejection. It works in main windowed context and don’t works in p-buffer of FBO. This was very bad news for me, because my favourite platform is NV, but i need FP16 framebuffer, which is impossible for window. Recently, playing with p-buffer configuration I’ve found that creating double-buffered p-buffer and using wglSwapLayerBuffers() enables early stencil rejection on NV! I’ve just comment the wglSwap… code and early stencil rejection disables. And more, playing with my test I found out that <early> stencil is limited resource on NV4x, just as HiZ-buffer on R500! If I created another p-buffers before my special p-buffer for early stencil testing, the speedup of utilizing early test drop from some degree to 0. I think the early stencil “resource” is limited in terms of memory, and you should create your framebuffers for stencil rejection as first as possible.

I understand, this sounds very strange but I made this assumptions looks at the behaviour of my application.

Sorry for my english, I’m russian guy and now is 1.51 for clear mind…

Today I’ve found that using early tests in windowed framebuffer before FBO disables early stencil rejection in FBO. And vice versa, using early tests in FBO beforce tests in windowed framebuffer enables early stencil in FBO and disables it in window. But this don’t affect at all the behaviour of a single and double buffered p-buffers.

Probably later I shall write an e-mail to NVidia in wich I shall ask them about this very strange behaviour…

Thanks for the replies guys and sorry for the late response!

Im continuing with the deferred shading engine, but am not going to concern with the early stencil rejection optimization for now.

Depending on the nature of a particular lights heuristic with respect to the current scene, you can avoid most of the unnecessary shader work; eg. while drawing a positional light’s bounding sphere that is close to the viewer, performing front-face cull with depth fail will reject many distant fragments anyway (thanks to early Z).

If there is a sure way of “enabling” early stencil rejection with frame buffer objects, then im very keen to hear about them not only for this particular problem, but for other stencil systems (stencil shadows for eg.)

Thanks again!

Edit: Oh, nearly forgot. JoeDoe, did you end up writing that email to Nvidia and receive a reply?