Hierarchial Occlusion Culling

Jose_Goruka · November 1, 2008, 11:07am

Hi! recently i stumbled upon the leadwerks engine (http://www.leadwerks.com/) and read the “paper” and features for their engine, which seemed pretty normal until i reached this section:

Leadwerks Engine uses per-pixel hierarchal occlusion culling performed on the GPU. There’s no need for complicated BSP systems or visibility compiling; If you can’t see it, it doesn’t get drawn.

So my question is… how is this possible? I know you can do conditional rendering and occlusion queries… but don’t occlusion queries force a flush if you want to read back the result from GL? (thus, hitting performance) It doesn’t make much sense to me… how else could something like this be implemented?

EDIT: I’ll explain better. i know what occlusion queries and conditional rendering are, no need to send me to read about them, question is about the “hierachial” part of the statament. I know basing your engine solely on standard occlusion queries and declaring portals/cells obsolete is kind of ridiculous… BUT! my understanding from their statament is that they do something like this (otherwise i don’t know why they claim that occlusion is hierachial)

cull octree, wathever
sort results front to back
for every result:
a) test object occlusion (reading back from GL)
b) if occluded, go up testing occluded octants in
the octree and mark them as non-visible, then go
down marking children octants and leafs as non
visible. Typical implemtation of HZB+Octree.
c) if not, draw the object and go back to a)

This in theory CAN make portals/cells a lot less needed… but as far as I know even if it can be implemented, the cost of reading back te occlusion query value forces a flush and it’s high… so my question is basically…

Can this method be implemented, given it’s higher level to what GPUs are used to?
Are the leadwerks guys claiming something i don’t understand? or is it just an full of it claim that conditional rendering makes portals, cells, pv, etc obsolete?

Simon_Arbon · November 1, 2008, 5:59pm

but don’t occlusion queries force a flush if you want to read back the result from GL?

When you call the commands to enable the occlusion test and render the bounding boxes, these commands are placed into the GPU command queue and the call returns to your program so it can queue the next command.
If you then immediately call GetOcclusionQuery then your CPU thread would block and wait until the GPU finished processing its command queue and the result was available.
The command queue is now empty, so the GPU stops and waits for you to send it something else to do.
This would be bad, very very bad.
The solution is simply not to query the result immediately but instead to keep adding rendering commands to the GPU command queue until the occlusion query has been processed and the result is ready for you to read.

There are several ways to do this, the simplest is to:

Render the occlusion test
Render something else
Get the occlusion result

This will stop the GPU from stalling, but your CPU thread may still block if the result isn’t ready.
To stop the CPU from blocking you can use the GL_PIXEL_COUNT_AVAILABLE_NV parameter to find out if the result is ready yet.
Often you will want to test occlusion for several bounding boxes at ther same time.
You should create a separate occlusion query object for each one and render all of them before you try to get the first result back.

Note that the GPU command queue can actually buffer several FRAMES worth of commands.
You may be sending OpenGL rendering commands that wont actually be drawn until the frame after next, which means that there can be a very long delay until the results of an occlusion query are ready.
To avoid the occlusion query becoming obsolete by the time you get the result, you should send no more than a single frame of rendering between queuing the occlusion query and waiting for the result.

Jose_Goruka · November 2, 2008, 7:57am

The solution is simply not to query the result immediately but instead to keep adding rendering commands to the GPU command queue until the occlusion query has been processed and the result is ready for you to read.

Yeah… the problem of this is, i guess, that an occlusion query for the method i described becomes obsolete by the time the next object is drawn… It’s too bad.

So, If i understand what you explain properly… maybe a solution would be to, once per frame, dosomething like this? (given i am understanding occlusion properly)

first doing a Z-only pass
cull the octree again and send some of the visible octants for occlusion query, down to some level of course, not necesarily the leafs since an octree can be huge
wait for the gpu, fetch the results, and then go and mark down the non visible nodes
draw again (using GL_EQUAL and not drawing the discarded nodes).

Maybe a system like this could definitely be good enough as a general purpose replacement for portals? forcing a sync per frame doesn’t look at all like something bad and games have been doing it for ages since the the vodoo1 for stuff such as lens flare…

Any ideas?

Nicolas_Lelong · November 2, 2008, 2:51pm

The hierarchical occlusion culling system you originally mentioned may be based on a principle similar to this :

http://http.developer.nvidia.com/GPUGems2/gpugems2_chapter06.html

HTH

Simon_Arbon · November 2, 2008, 4:18pm

No, it becomes obsolete the next FRAME (and possibly not for several frames if nothing moves very fast).
The query is valid until the camera or the occlusion bounding box moves far enough to change which objects are occluded.

Render the main occluding objects (buildings, tunnel walls, and other objects close to the camera) in a Z-only pass.
Render a separate occlusion query (a bounding box) for each region of the world that would otherwise have been separated by portals.
Call glFlush
Wait for any one of these queries to become available.
If that bounding box was invisible then everything in that region can be ignored (cull the octree branch).
If the bounding box had visible pixels then render occlusion queries for the next level in the octree (bounding boxes of large objects that are inside of that region)
Repeat as each the results for each outstanding query become available
When a branch of the octree reaches actual objects (the leaf nodes) then render those objects with a z-only pass
When the last occlusion query has returned results and the Z-only passes done, Render to backbuffer all of the leaf nodes that were not culled.

The glFlush does not cause a synchronisation, it just ensures that all of the commands you previously placed in the command queue are sent to the GPU for processing.
The GPU will not stall as long as you have several queries being processed at the same time.

To reduce the total number of queries you could also use the results from the previous frame as the starting point in the octree for the current frame, because the occlusion will not change very much in a fraction of a second.

Occlusion queries can provide a good performance boost for large worlds with many thousands of objects, or if you are running on old hardware.
On the latest hardware however the GPU is so fast that it may empty its command queue faster than the CPU can process the results and send the next commands.
Hence, if the above is not fast enough and you need every drop of performance, conditional rendering should be used instead of having the CPU query the results and make the decisions.