Tree selection for object culling

nimelord · March 5, 2019, 3:33am

Hi, guys!

Which of trees has better performance for object culling open in world scene: octree or r-tree?

Have a good day!

nimelord · March 5, 2019, 6:24am

r-tree can’t be used because of intersections.

Dark_Photon · March 6, 2019, 4:54am

What? That doesn’t make any sense.

You didn’t really state what kind of culling you want to do, anything about the dataset you are culling (e.g. does it even fit within RAM, for example), whether your dataset is static or dynamic, or the cost of the work you’re trying to prevent.

For 3D realtime graphics, the traditional answer when choosing a Spatial Acceleration Data Structure is between two main classes of techniques: Spatial Subdivision (SS) and Bounding Volume Hierarchies (BVHs). If your scene is static and you precompute time is a non-issue, you may be able to get a tighter fit and more efficient culling with SS. However, if your scene is dynamic, you care about the cost of adding/removing nodes on-the-fly and the resulting performance of the data structure after modification. There, BVHs can often yield better overall performance.

Also, you don’t have to just pick one. Use whatever method makes the most sense for the type of data you’re culling and the cost of the work you’re trying to prevent. Read up on k-D trees and bounding sphere trees, and consider those alongside the other data structures in your toolkit.

nimelord · March 19, 2019, 12:13pm

Photon, thank you very much for the basic info

nimelord · March 25, 2019, 11:05am

I found good article about octree with clear description.

So, a node can contains 8 or less subnodes and unlimited number of renderable objects.

I need to provide octree on the GPU side (SSBO) because of some reasons.
One of them - (occlusion) culling implemented on the GPU side.
For subnodes I can use array with length 8.
But huw can I provide unlimited number of renderable objects on the node in SSBO?

Obviously I can’t produce many different SSBOs, it should be just one for all nodes.

Thank you for any help.

GClements · March 26, 2019, 12:50am

Use the same structures as you would for client-side code, but use indices instead of pointers. The main options all boil down to storing an index into an array of integers. At that index is either

A count, followed by that many object IDs.
A list of object IDs terminated by a sentinel value (e.g. 0 or 2^32-1).
An object ID and either a sentinel value to indicate that it’s the last element in the list, or an index to the next object ID. IOW, a linked list.

nimelord · March 29, 2019, 10:00am

It is looks ok as a solution.
Thank you!

But new problem appeared with octree on the GPU side.

My prev. implementation was culling all objects of scene in order.

On the rendering step (from transform feedback result) I had object ids sorted by meshes and I had occlusion query result for every mesh.
Thus I was known what mesh I should bind for rendering.

In new way (using octree) I will receive unsorted ids of visible objects.
How to know which mesh I should bind?

nimelord · April 2, 2019, 7:46am

Do exists a possibility to do management of octree on the GPU side?

I can serialize whole octree to the SSBO, it is not problem for me anymore.

But two points still are problems:

1 - result of occlusion culling via transform feedback (TF) is a list of visible object indices. How can I provide info which VBO I should bind for render current object after TF?

2 - I can’t realize how can I cull nodes and objects? Node can contains some subnodes also an list of own objects.
Should I make two pass algorithm?:

build list of all visible nodes.
build list of all visible objects related with visible nodes.

Or it can be implemented as one pass algorithm?

Do modern applications use GPU side implementation of octree or they prefer use to CPU side management of it?

nimelord · April 8, 2019, 7:44am

I found solution. I can be GS with several output streams and GS instancing:
In OpenGL 4.0, GS’s gained two new features:

One was the ability to write to multiple output streams. This is used exclusively with transform feedback, such that different feedback buffer sets can get different transform feedback data.
The other feature was GS instancing, which allows multiple invocations to operate over the same input primitive. This makes layered rendering easier to implement and possibly faster performing, as each layer’s primitive(s) can be computed by a separate GS instance.

But I still don’t know how to provide mesh info on the rendering stage.
For example:
I have only one visible node.
That node contains 11 objects.
Culling rejects 4 of them.
Query result returns 7.
How I know which meshes I should bind for those 7 objects on the rendering step?

Thanks for any help.

Dark_Photon · April 8, 2019, 12:11pm

It’s unclear why you think you’ve found a solution when you’re not sure whether this solves your problem.

Also, given that you’re culling and haven’t mentioned LOD binning, it’s unclear what you use you plan to put the Multiple output streams capability of geometry shaders to, much less Geometry Shader Instancing. Could you clarify?

By what I infer from your posts, if you apply geometry shaders here for culling, all you need is a non-instanced geometry shader writing to a single output stream accepting/writing points primitives and utilizing selective emission based on whether a primitive culls in or not, as Rastergrid (@aqnuep) describes here: Instance culling using geometry shaders.

To render with the correct mesh afterwards, you’ve got options:

You can do separate geom shader “culling” runs per mesh (i.e. only cull-in the instances referencing that mesh; the rest cull out). Then your output data is pre-binned by mesh. N geom shader culling batches (draw calls) yields N lists of culled instances, 1 per mesh.
If you have a small number of meshes, you could use the Multiple output streams capability of geometry shaders to bin instances for different meshes into different output streams (see Rastergrid’s post on that). However, the max num streams is very limited (4), so this doesn’t scale to large numbers of meshes (except using num_passes = num_meshes / 4, which is awkward and only a constant factor). Consider a compute shader here.
If your meshes differ only in which texture is used to render them, you can either 1) output the bindless texture address for the instance from your geometry shader or, 2) output the instance ID from the geom shader and in the later shading run of that instance do an instance ID → texture address lookup to get the texture. However, I’m assuming when you say different meshes, you really mean it (different vertex data and/or connections between those vertices, so this is probably not applicable).
You could always do a post-culling pass on the GPU to bin your culled “mixed mesh” instance list into separate buffers based on the mesh ID for each instance ID. Consider a compute shader here.
As a fallback to this, in the shading pass you could throw all of the instances at all of the meshes, and have their vertex shader throw away instances which don’t refer to the current mesh ID. This wastes vertex shader threads however. Since you’re focusing on culling efficiency, that’s probably not what you want.

Just keep in mind when using geometry shaders that they can be fine for 100s to 1000s of prims. But try to scale them up to something like all the primitives on the screen, and you are unlikely to be happy with the results. The way they map to the hardware and the graphics pipeline makes it difficult for the graphics driver to implement them efficiently.

nimelord · April 9, 2019, 7:15am

First of all, thanks for deteiled answer!

Actually I have two basic problems:

one pass culling (I mean I don’t want to cull nodes one by one, because such process looks as not efficiency at all. Maybe I’m wrong here.)
provide meshes for result ssbo with instances. (Relating that point, You exact understood what I meant.)

I plane to use GS instansing and multiple output streams for first my problem.

This is example of GS (not finished yet):

#version 430

layout(points) in;
layout(invocations = 10) in;
layout(points, max_vertices = 1) out;

struct Bb {
  vec4 nBb;
  vec4 pBb;
};

struct TreeNode {
  int[8] children;
  Bb bb;
  int parentId;
  int itemArrayStartId;
};

layout (std430, binding = 20) buffer TreeBlock {
  TreeNode node[];
} tree;

in int nodeIndex[1];
flat in int objectVisible[1];

layout(stream = 0) out stream0 {
  out int visibleCurrentNodeIndex;
};

layout(stream = 1) out stream1 {
  out int visibleNodeIndex;
};

layout(stream = 2) out stream2 {
  out int visibleItemIndex;
};

void main() {

  if ( objectVisible[0] == 1) {

    TreeNode node = tree.node[nodeIndex[0]];
    if (gl_InvocationID < 8 && node.children[gl_InvocationID] != -1) {
      // emit child if it present (index is not -1)
      visibleNodeIndex = node.children[gl_InvocationID]; 
      EmitStreamVertex(1);
      EndStreamPrimitive(1);
      return;
    } else if (gl_InvocationID == 8) {
      // emit current node if it visible and has any scene object
      visibleCurrentNodeIndex = nodeIndex[0];
      EmitStreamVertex(0);
      EndStreamPrimitive(0);
      return;
    } else if (gl_InvocationID == 9) {
      // emit scene object
      // it needs to be finished:
      // I want to emit all objects of this node if it visible
      // But GS instansing is limited.
      // So I need to run shader several times for current node if number of own objects more than GS limit.
      // Also I need to find possibility to provide last processed object index: It seems I can't bind SSBO for reading and writing at the same time. 
      visibleItemIndex = node.itemArrayStartId;
      EmitStreamVertex(2);
      EndStreamPrimitive(2);
      return;
    }
  }
}

It is possible to bind SSBO for reading and writing at the same time?
Relating to writing I mean something like that: tree.node[nodeIndex[0]].lastProcessedObjectIndex = calculatedValue;

I thought about that, but I have thousands of meshes, thus ssbo limits don’t let me do that.

Dark_Photon · April 11, 2019, 1:14am

Yes. But accesses to it are incoherent, so you have to synchronize yourself.