GL_EXT_transform_feedback + lod selection + frustum culling = slow

I’m using a vertex and geometry shader to do lod selection and frustum culling into a buffer bound for transform feedback (GL_EXT_transform_feedback). I’m subsequently using that buffer as the instance source for a glDrawElementsInstanced() call, but I’ve currently got that bit disabled so I can specifically profile the lod selection and frustum culling stage.
The input attributes to the vertex shader are a 4x3 matrix (4vec4) and a bounding sphere (1vec4).
The input uniforms to the vertex shader are camera position (vec3) and frustum planes (6*vec4).
The vertex shader does the cull/lod tests, and outputs the 4x3 matrix and a ‘visible’ flag to the geometry shader.
The geometry shader only emits ‘points’ if the vertex shader ‘visible’ output is 1.

Problem is, I’m only getting approx 57 million points processed per second on a quadro 4000.

Question is, is there some performance trick/caveat I should be aware of when doing this sort of thing?
Note that the code is not doing the GL_QUERY_RESULT (so not a stall problem) or the glDrawElementsInstanced() (so not related to the instancing API).

Thanks for any advice offered.

Geometry shaders do usually introduce some performance penalty, especially with transform feedback on NVIDIA cards, as far as I can tell.

during the transform-feedback.

Your gpu isn’t old, so cannot be plagued by GF8800-type geometry-shader slowness.

Yes, maybe I’m wrong, don’t really know what generation is the quadro 4000. I should look it up.

Thanks guys, I see no other way of accelerating instancing without geometry shaders. I have discard enabled, sorry I should have said. I thought maybe there’d be some buffer flag set incorrectly or something but I’ve tried stream_draw, static_draw etc. but makes no difference. I have to say, this whole transform feedback extension saga (spec repo is awash with em!) looks like a bad joke.

Looks like GTX480 era (GF100), so a couple generations after GeForce 8.

The Quadro 4000 isn’t a very fast card, though – it’s a GF100 with only half the shaders enabled (256) running at a lowly 475MHz (GEforce 480 is 480 shaders @700Mhz). 57 million pnts/sec doesn’t sound overly unreasonable to me, having using a 4000 for quite a while.

Could you test agnuep’s demo, that does exactly the same:

On startup (without navigating through mouse/kb), it gets 46fps, at culling+drawing millions of instances; on my GTX 550 Ti, which is of similar architecture and power.
P.S. with the 4000’s specs, I’d expect it to be able to process at least 500mil tri/s.

thanks, yes i’ve tried that nature demo before, just ran it again and did the same calculation - same results as my renderer - approx 50 million instances per second culled (10,000 tree instances + 250,000 grass instances = 260,000 total instances at 180fps = 46.8 million instances per second). Slightly slower than my results because that demo is doing the feedback count query and actually drawing the instances.
Ah well, I guess I have to just swallow the fact that it’s only slightly faster than doing the culling/lodding on the cpu, mapping a VBO using the buffer orphaning technique, and pushing the instances up every frame. Which is crazy if you think about it.

actually forget what i just said - if, in the nature demo, i fly down under the grass so nothing passes the cull test i get 590fps, which means it’s culling on the gpu at a rate of 153 million instances per second, so 3 times what i’m getting. Right, now to compare the code…

It should be pretty fast, not to mention that since the introduction of GL_AMD_query_buffer_object you don’t have to stall in order to get the result but you can completely pipeline the feedback operation thus you could get even higher frame rates.