GPU Cull and LOD Performance

I implemented GPU culling and LOD using transform feedback. I use buffer object queries to populate a GPU side buffer with the instance counts. Then, I use indirect draw calls to draw the instances. Even when no instances are written to the instance buffers, I still have to call the indirect draw calls because the CPU code doesn’t know the instance counts. Anyway, I’ve noticed that when all the instance counts are 0, the draw calls still seem to suck up a lot of frame time.

Has anyone else seen this behavior? Is this a driver issue? I’m using a GeForce 780.

Thanks.

How many instance groups and LODs per group are we talking about rendering here max? And how much time are we talking about per instance group (time for both passes vs. just rendering the first pass, when all of the instances cull-out)?

Seeing a code snippet would make it easier to offer more concrete (and relevant) suggestions. You might post a GL call trace dump for the cull/LOD prepass and the final draw calls for at least 2 groups of instances (e.g. instance group 1 (all LODs), instance group 2 (all LODS)).

I think the key here is determining what your biggest bottleneck w.r.t. this slow-down, and then determining what you can do about it. Since nearly all of the driver work is deferred until the draw call, it’s not surprising the bottleneck appears to be there.

For instance, are state changes needed to prepare for the instanced indirect draw calls the chief cause of the slow-down? (e.g. program changes, texture binds, other uniform setup, etc.)?

Is it vertex attribute setup for those draw calls (buffer binds, etc.)?

Is there some explicit or implicit synchronization being instigated here which could be the culprit (e.g. barriers, CPU blocks or ghosting caused by modifying buffer objects and/or textures already in-flight, etc.) (You’re doing something like this, right?)

Is there some GPU uploading going on in the steady-state case?

Depending on what you find, different ideas come to mind.

Thanks for your response. Yes, I’m doing exactly what you linked. Past the initial upload of instance data, everything should be generated on the GPU. No round trips to CPU. You make a good point about draw call setup. I’ll do some more investigating and see what I find. However, the added frame time to draw 0 instances is quite high–like 5-10 ms. The number of instances that are submitted to the transform feedback shader is only about a thousand.