transform feedback + glDrawElementsInstanced

peterfilm · July 11, 2012, 6:10am

In order to avoid the query object stall when combining EXT_transform_feedback with glDrawElementsInstanced it seems to be recommended to use the ARB_draw_indirect extension - but for the life of me I can’t find any information on how I get transform feedback to populate the GL_DRAW_INDIRECT_BUFFER needed for the new set of functions this extension introduces.
I’ve seen people talk about OpenCL, but how do I get OpenGL’s transform feedback mechanism to do it?
thanks.

(I’ve deliberately littered this post with the keyword breadcrumbs I’ve been searching with for people with the same question!)

aqnuep · July 11, 2012, 6:37am

What do you mean by query object stall with transform feedback and DrawElementsInstanced exactly? What’s your use case? Do you feed back vertex array data or instance data using transform feedback?

If you feed back vertex array data then you should use DrawTransformFeedback to do a non-indexed rendering of the fed back vertex array data.

If you feed back instance data then you would need atomic counters in the vertex shader or geometry shader, though I’m not aware of any driver supporting non-fragment shader atomic counters currently.
However, on AMD hardware you can use the new GL_AMD_query_buffer_object extension to feed back the result of a primitive query to a draw indirect buffer in a non-blocking manner. Example #4 in the spec might be just what you are looking for.

peterfilm · July 11, 2012, 6:46am

yes i’d just been reading the AMD_query_buffer_object extension just now! spooky. Frustratingly this extension is not supported on the nvidia quadro 4000 even though it’s exactly what i need (example #4 could have been written with me in mind).
yes i’m trying to do frustum culling and lod selection on the gpu, just as you have done in your demos and just as I talk about in my other forum thread (where the question was performance).
now I’ve got everything writing to multiple streams, one stream for each lod, and the culling/lod selection is very fast indeed (still approx 50 million per tests per second, but with multiple streams i don’t have to do multiple passes over the same instance data!) - but i’ve now identified the GL_PRIMITIVES_GENERATED query as a pretty significant bottleneck. This is why I’m looking for ways of getting the primitive generated count to the draw command without the CPU readback.

peterfilm · July 11, 2012, 7:17am

btw, when i say a significant bottleneck i mean it takes the overall framerate down below doing the culling/lod on the CPU and using glMapBufferRange() to upload the results. So unless I can sort this out, I’ll be abandoning the GPU approach.

aqnuep · July 11, 2012, 7:56am

Well, you have at least two options:

Use AMD_query_buffer_object if you can limit your target audience to AMD hardware (however, I hope that NVIDIA will implement it soon too).
Use the visibility results of the previous frame to avoid the stall (you can even have a 2 frame delay). Obviously, this might result in popping artifacts, however, if your camera is not moving super fast and if you have decent frame rates, that one or two frame delay should not have any visible effect on your rendering.

peterfilm · July 11, 2012, 8:10am

well that’s where it gets complicated (option 2 i mean). You see the instance renderer is used in a number of cull/renders - multiple viewports, quad buffered stereo, cascaded shadow maps… it’s just not practical to have a vbo for each lod for each cull/render phase. Apart from the memory wastage, there’s also the code complexity.
Ah well, life eh.

peterfilm · July 11, 2012, 10:08am

i really love the simplicity of that AMD extension. The idea of the GL writing the query result into a buffer so we can then bind that buffer to the GL_DRAW_INDIRECT_BUFFER target is just gorgeous.

It’s bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU - I mean, OpenGL is supposed to be primarily for graphics and this is one of the oldest requirements for any graphics application. I don’t see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API’s to do such a simple thing.

NVidia, just implement the extension already, for the love of god.

Alfonse_Reinheart · July 11, 2012, 11:30am

It’s bizarre that it seems to be so difficult to do frustum culling (and waaay more importantly, lod selection) on the GPU

Um, why? Frustum culling is, at its core, a very different operation. GPUs are for drawing triangles. Culling is about doing arbitrary computations to determine a binary value.

Also, I’m curious as to exactly how writing the query result (which is either the number of fragments that pass or a true/false value) allows you to do LOD selection. Frustum culling I can kind of understand, sort-of. You can write a 0 value when the query is not visible. But how exactly does LOD selection work.

I don’t see the reason why I should have to use CUDA/OpenCL combined with some fudge buffer sharing mechanism between the two API’s to do such a simple thing.

Because OpenGL is for rendering and GPGPU APIs are for generic computations. Frustum culling and LOD selection are generic computations that are used to feed rendering.

I’m not saying it’s a bad extension. But personally, I’d say that LOD selection is something that the CPU should be doing, considering how dirt simple it is (distance fed into a table).

NVidia, just implement the extension already, for the love of god.

Personally, if NVIDIA’s going to implement any of AMD’s recent extensions, I’d rather see multi_draw_indirect, sample_positions, or depth_clamp_separate.

aqnuep · July 11, 2012, 1:02pm

You don’t use an occlusion query, but a primitive query. You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.

NVIDIA already implemented AMD_multi_draw_indirect a while ago. Btw, using the query buffer and the multi draw indirect extension can be used together to further limit the number of draw calls necessary for the idea what peterfilm wants to implement.

Alfonse_Reinheart · July 11, 2012, 3:10pm

You perform view frustum culling in the geometry shader and perform LOD selection and output the instance data (if the object is visible) to the transform feedback stream corresponding to the selected LOD.
By using a primitive query for each transform feedback stream and by writing the result of the queries to the primCount fields of an indirect draw buffer you can perform the whole rendering without any CPU-GPU roundtrip.

And… this is supposed to be fast? Using a geometry shader and performing per-triangle frustum culling/LOD selection, while using transform feedback? How is this faster than just rendering the models using traditional CPU-based methods of whole-object culling and LOD? You have this whole read/write/read loop going on in the shader. That requires an additional buffer just to write this intermediate data that you then render.

Also in general, when I think performance, I don’t think geometry shaders.

Also, why not just use glDrawTransformFeedback or its stream version to render it?

aqnuep · July 11, 2012, 4:43pm

No, nobody said that. You perform per-instance or per-object frustum culling/LOD selection using a geometry shader. That’s orders of magnitude less work than the actual rendering.

While using a geometry shader does has its cost, it’s not the evil itself

Alfonse_Reinheart · July 11, 2012, 6:55pm

How exactly? It is actually rendering. In order for the output primitive count to match the input primitive count, you have to be outputting the primitives you want to render. Which means that this pass is drawing all of the triangles for every LOD for every object that exists in the scene.

It may not be scan converting and rasterizing them. But it is passing them through the vertex and geometry shaders. Which means the GPU reads them from the buffers and has to do transformation at least. You have to do vertex processing for each visible object twice (though the second time is just pass-through). That’s a lot of redundant reading of memory. You read each object, write it to another location, then read it from there to render it.

Again: how is this faster than just regular rendering via a deferred renderer?

peterfilm · July 12, 2012, 2:37am

the thing you’re missing alfonse is that the transform feedback pass is just drawing a long list of GL_POINTS (with rasterization disabled), each point contains vertex attributes, those vertex attributes are the entire objects transform and bounding volume (so in my case that’s a mat4x3 for the transform and a vec4 for the sphere). The output of this transform feedback pass is a list of vertex attributes for each lod (I just output the mat4x3, the sphere has done its job) intended to be used in a glDrawElementsInstanced, as the per-instance data not the mesh data.
You might think this is a CPU job, but when you’re talking about 10’s of thousands of instanced objects being passed over the bus each frame (more if you take into account the shadow passes), then you can start to see the saving of doing this simple bounds/lod test on the GPU itself and then telling it to draw from the list it’s just generated. To be honest I’m not that bothered about the frustum culling, I have a quad tree to cull the majority on the CPU anyway, it’s the lod selection that’s the real gain - that realistically has to be done per-instance, whereas frustum culling can be batched like I do in my quad tree.

peterfilm · July 12, 2012, 2:53am

here’s some numbers:-

instances:-
26781

CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps

NOTE: this is just the culling/lod selection. I’ve commented out the drawing code.

So as you can see, it’s definitely worth doing the culling on the GPU!
Just that pesky readback that spoils the party and drags the fps down significantly (by readback I mean that in the drawing code it has to get the value of the GL_PRIMITIVES_GENERATED in order to feed that value into the primCount parameter of glDrawElementsInstanced to actually draw the mesh instances themselves).

thokra · July 12, 2012, 3:43am

Looking at the numbers I find the discrepancy quite astonishing but I don’t quite follow the data flow. Do you mind lining up your GPU approach in list of subsequent operations for dumb people like me?

Edit: If possible add the CPU path as well as to enable people to compare the approaches.

Edit 2: By no means I intend to judgmental here! It simply looks quite intriguing and I’d like to see how it works.

peterfilm · July 12, 2012, 4:49am

I’d gladly do that, but aqnuep has already done a splendid job of writing this stuff up on his blog.
it’s got diagrams and everything! ignore the hi-z business for now.

disclosure: i’d already got this stuff working before i found his blog (looking for optimisations), so please don’t think i’m a copy cat (not that there’d be anything wrong with that, I just want to retain some kudos for the idea…god knows i get little enough of them).

thokra · July 12, 2012, 4:53am

Thank you (and aqnuep of course)! I thought I read that but it was actually the earlier instance culling post.

ignore the hi-z business for now.

No I will not!

Dark_Photon · July 12, 2012, 5:31am

[QUOTE=peterfilm;1240037]CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps[/QUOTE]
So 1.69ms/frame for CPU, and .501ms/frame for GPU. Net savings: 1.19ms across 26781 instances (aka 0.44ms/10,000 instances).

(FPS really is a horrible way to bench. Non-linear. Interesting thread though!)

thokra · July 12, 2012, 5:56am

Dark Photon: What do you make of that ~1.2 ms gain? If you’re tight on budget it seems reasonable. Otherwise … I don’t know.

BTW, shame on me for being blinded by those sneaky FPS.

Dark_Photon · July 12, 2012, 6:15am

Well, if you’ve got really lose framerate requirements it might not be so important. But for those that have 16.66ms to do everything or they’re dead, 1.2ms is a lot of time and worth reclaiming.

It’d be good to have data on which specific GPU and CPU this test was done on to ground these benchmarks. Peter?

I like the spirit of AMD_query_buffer_object. I’m all for nuking GPU pipeline bubbles and keeping the work blasting as fast as possible on the GPU. The author list on that extension is interesting too

Maybe AMD and NVidia can work out a deal here: AMD implements NV_vertex_buffer_unified_memory (batch buffers bindless only; no shader pointers) in exchange for NVidia implementing AMD_query_buffer_object. Result: Everybody gets improved perf from their GPUs.