Frustum culling doesn't increase FPS

I have an application with a terrain an instanced object of about 1000 instances and several 3d objects in the scene. I have low frame rate (about 30 FPs) and I thought to do frustum culling. I implemented frustum culling in the geometry shader for the instanced object and the terrain but the FPS didn’t change. I still have 30 FPS. Is it wrong that I implemented it in the geometry shader?

Not necessarily.

What it sounds like is that you have not optimized your largest performance bottleneck.

Establish that first. Then seek to optimize it.

Thank you for the reply. How can I find what my bottleneck is? I know I do 6 renders per cycle for all my objects to buffers. Is that a lot? Should I use NSight?

There are various ways to do that. Probably the first thing to establish is whether your bottleneck is primarily CPU-side or GPU-side (the latter including the back-end driver).

Run a simple CPU side profiler on your app, like Very Sleepy. It’ll tell you where your app seems to be spending most of its time. You can use that to spot obvious CPU-side bottlenecks that you didn’t know about.

If it’s spending most of its time down in GL calls or SwapBuffers, then it’s definitely worth running Nsight Systems and/or Nsight Graphics to see what exactly your app is doing and do some basic frame profiling and analysis.

Separate from these profiling tools, you can help nail down performance bottlenecks by changing one factor in your frame rendering which could be a bottleneck while keeping the others relatively constant. For instance, reducing/increasing the number of pixels in your render target. Or reducing/increasing the number of vertices per instance. If you vary that one factor and the total frame time is the same, then that isn’t the bottleneck.

Also for these test timings, you may find it useful to structure your frames like this:

  • Submit GL commands, SwapBuffers(), glFinish(), Capture time
  • Submit GL commands, SwapBuffers(), glFinish(), Capture time

The total frame time to render a frame is then the delta between the captured time for the end of the previous frame and the end of that frame. This total frame time will include all CPU-side and GPU-side time needed to render that frame. That way, you have a good basis to compare against as you tweak things in your app to look for the bottleneck.

NOTE: When you’re shooting for max rendering parallelism (max perf), you would skip the glFinish(). But when profiling, adding the glFinish() as shown above can help to avoid chasing ghosts related to driver queue-ahead behavior.

Also, note that the geometry shader stage happens after the vertex shader stage. Therefore, all vertices are computed.

Good thing. Thanks.

It looks like that I’m not improving at all with the geometry shader. Probably OpenGL does frustum culling after the vertex shader by default. I found some bottlenecks by enabling and disabling features. I found that anti-aliasing is taking about 20 fps for 8 samples and I dropped it to 4 and my shadows were using a framebuffer with a 3072 size texture which I reduced to 1024. With those two changes I am able to have 50 - 60 frames with 7 off-screen buffers. Are 7 off-screen buffers a lot?

There is the clipping stage.

You really shouldn’t be doing performance measurements by “fps”. Look at the actual time it takes to render a frame. Something being “20 fps” slower is meaningless without knowing the time it used to be. If you’re at 400fps, being “20 fps slower” is trivial.

It does clipping. Any primitives which are entirely outside the frustum will be empty after clipping.

Application-based frustum culling is something you perform on larger chunks of data (e.g. objects, or distinct sections of a mesh) so that you can discard many primitives with a single test. Culling individual primitives using a geometry shader is likely to be a net loss.

Thank you all for your replies. Is it worth applying frustum culling on the CPU so that the mesh that is sent to the GPU will contain only the triangles that will go inside the view. This way the vertex shader won’t be called for every vertex of the mesh, probably increasing performance on the GPU side but setting a burden on the CPU side that will have to go over each vertex of the mesh and apply the test to see if it is inside the view and after use dynamic data on each frame to update the data in the GPU with the filtered mesh… ?

The vertex shader would have to be ridiculously expensive to make it worthwhile to cull individual primitives. Frustum culling is used when you can reject hundreds of triangles with a single bounding sphere/box test, i.e. you perform the test on each object or each chunk of the terrain.

What GClements said. Cull whole objects or groups of objects (e.g. instances), not individual primitives. Re GPU culling, same thing.

But do your own tests to convince yourself. Run your draw loop as-is with:

  1. No objects drawn
  2. All objects drawn, but positioned out-of-frustum
  3. All objects drawn, but positioned in-frustum.

Capture the frame times for each (with VSync off, of course), and compare them. What do you see?

(enabled 4 samples anti-aliasing, cascaded shadows 3 draws, and water 2 draws(reflection and refraction))

  1. No objects drawn : 450 fps, delta time : 0.0017 sec
  2. All objects drawn, out of frustum : 65 fps, delta time 0.016 sec
  3. All objects drawn, in frustum : 37 fps, delta time 0.035 sec

Do you make something of this?

If your “in frustum” test is using GPU-based frustum culling via a GS, odds are good that one of the following things is going on:

  1. You are not using indirect rendering to render the results of the GS’s frustum culling. That is, the data written by the GS is being read by the CPU, which requires a full CPU/GPU sync. That’s bad.

  2. Your GS is doing per-triangle culling, not per-object culling. The way GS culling is supposed to work is that each “primitive” you give the GS represents an entire object. The “primitive” the GS writes is the indirect rendering command that will be used to render it. The written data is delivered to a buffer via transform feedback, which will be used as the source for the indirect rendering operation.

  3. You aren’t rendering enough stuff to be worth the overhead of GPU frustum culling. Nothing is free; doing GPU frustum culling is worthwhile only to the extent that it either frees up CPU that you could put to better uses or uses GPU resources that would otherwise go unused. This depends on both your implementation and your workload.

@Alfonse_Reinheart

  1. No I am not using indirect rendering
  2. Yes the GS is doing per triangle culling
  3. I am rendering a terrain of 170000 vertices, 1000 instanced 3d objects, water, and several 3d objects. It’s not that much, it’s true.

I implemented per object culling and the performance is much better now. Thank you all for your suggestions. One thing that still bothers me is that even without any objects on screen I don’t get thousands of fps… Is that something worth researching?

Just out of curiosity, how things are changing if you disable your GS ?

What’s your current graphic card ? On what layer are you running your program on (ie glut, sfml, graphics libs…) ?

I have Object culling on CPU. I disabled the GS because it has no effect at all. I can post code for you to check it if you want.

I am trying everything on a laptop with core i7 9th gen and geforce gtx 1650. Medium specs for graphics card but still it should perform better. I am using GLFW as the windowing system if that helps…

I managed with disabled 6 of the 7 frame buffers to get 1200 fps, with only 1 buffer that draws to texture which is shown on screen. But still it is not 4000 fps that I’ve seen on other systems.

What are the timings of your culling ?

Same systems ? At this rate, you should start to hunt useless tenth of microseconds.

Timings for 1000 instanced objects : 0.0053 sec