transform feedback + glDrawElementsInstanced

Intel Xeon Quad Core 2.66GHZ, 8GB ram, windows 7 64 bit. Quadro 4000 2GB ram driver 296.88.

Forgive me for the fps metric, was in a hurry.

1.2ms is for a relatively small number of instances versus the number I’m actually going to be required to render. Also, consider that this is just for a single pass, whereas I need to also render into the second eye of a stereo pair, and into 4 csm splits. There’s also a picture-in-picture second view, albeit without shadow maps.

OK, but that doesn’t explain how it does LOD selection. LOD selection would have to mean changing the model being rendered, yes? Which would require writing values to an indirect buffer, which would then be used with an indirect rendering command.

I don’t see what you need query_buffer_object for in this case. Because the number of objects that pass (ie: the number of indirect rendering commands written) needs to come back to the CPU to be used with multi-draw-indirect. Or to loop over the indirect rendering commands.

Also, I don’t see how this constitutes instanced rendering, since each instance has its own indirect drawing command.

Or, to put it simply, can you fully describe the algorithm, top to bottom? Because there seem to be some inconsistencies between the descriptions you given thus far.

[QUOTE=peterfilm;1240037]here’s some numbers:-

instances:-
26781

CPU culling/lod selection, with glMapBufferRange to pass results to GPU:-
590fps

GPU culling/lod selection, with vertex/geometry shader and transform feedback:-
1995fps

NOTE: this is just the culling/lod selection. I’ve commented out the drawing code.[/quote]

Since you’re using instancing, what’s the performance of not doing frustum culling at all and simply drawing all of the instances?

no, i still issue a glDrawElementsInstanced() call for each lod once the queries return me the primCount for each lod.
I’m not using the indirect extension, which is what the original question was in this thread - i see no way of writing to the indirect buffer from transform feedback.
I gave a link to rastergrids blog which explains the algorithm clearer than I have obviously done so far.

The problem I’m trying to solve is not specifically the frustum culling, as I said in an earlier post (keep up man!), it’s the lod selection. I’m attempting to mask the simplification of the vegetation geometry by sticking to the lod distances carefully set by the artists - batching them together makes too sudden a pop. I’m trying to stop the pop without an explosion in triangle count.

i see no way of writing to the indirect buffer from transform feedback.

Sure you can. You just need to employ atomic increments.

Each LOD’s per-instance data is being written to a separate stream. Every time you write an instance to one of the LOD streams, you atomically increment that LODs atomic counter.

Now, atomic counters are backed by buffer object storage. But you can use glBindBufferRange, as well as the offset field of the atomic counter’s layout specifier, to put them anywhere in a buffer object’s storage. Like, say, the primCount value of an indirect rendering command.

Each counter can be set to write to the primCount field of a different indirect rendering command, one for each LOD. Thus, when you’re finished, you have three indirect rendering commands, all ready to go.

The only thing you need to do is issue a glMemoryBarrier(GL_ATOMIC_COUNTER_BARRIER_BIT) after building the LOD instance data, but before trying to render them. And of course, reset these values to zero each frame before specifying the LODs.

I have no idea if this will be faster than what you’re doing. But there won’t be any GPU->CPU->GPU antics.

Yes that’s what I was afraid of. The whole atomic counter stuff scared me, possible sync issues etc. And then aquen mentioned that you can only use atomic counters at fragment level…
But thanks for the clear explanation of how I’d use them if it came to it. I can but try I suppose, with a heavy heart.

The whole atomic counter stuff scared me, possible sync issues etc.

So, you’re frightened by atomic counters, even though the use in this case is fairly obvious and requires exactly one sync point. But you’re perfectly fine with rendering something that’s not rendering anything, using multiple output streams and geometry shaders that aren’t shading any geometry, all to write stuff to a buffer object that you’ll use to render instances of geometry.

If you’re going to yoke the GPU to do cool stuff, then yoke it. You’re already forced to use GL 4.x hardware by your use of multiple streams. Best to use all of it.

aquen mentioned that you can only use atomic counters at fragment level

Then he’s wrong. There is nothing in GLSL or OpenGL about where atomic counters can be used.

Thanks for that!

[QUOTE=Alfonse Reinheart;1240082]Sure you can. You just need to employ atomic increments.

Each LOD’s per-instance data is being written to a separate stream. Every time you write an instance to one of the LOD streams, you atomically increment that LODs atomic counter.

Now, atomic counters are backed by buffer object storage. But you can use glBindBufferRange, as well as the offset field of the atomic counter’s layout specifier, to put them anywhere in a buffer object’s storage. Like, say, the primCount value of an indirect rendering command.

Each counter can be set to write to the primCount field of a different indirect rendering command, one for each LOD. Thus, when you’re finished, you have three indirect rendering commands, all ready to go.[/QUOTE]
Yes, actually that should work and if you think about it, if you use a load/store image and multi draw indirect, you can even do non-instanced object culling in the same way. If I’ll have time to implement something like that, I’ll post about it on my blog :slight_smile:

No, you’re wrong. You need glMemoryBarrier(GL_COMMAND_BARRIER_BIT). Everybody seem to misunderstand how glMemoryBarrier works. It does not specify “what source” are you trying to sync but rather “what destination”. In all cases glMemoryBarrier is meant to ensure that all shaders that performed image load/stores or used atomic counters finished before the commands after the barrier start. What the barrier bits specify is how you plan to use the written data. This ensures that all the appropriate input caches get flushed before commencing the next draw command.

Quote from spec:

COMMAND_BARRIER_BIT: Command data sourced from buffer objects by Draw*Indirect commands after the barrier will reflect data written by shaders prior to the barrier. The buffer objects affected by this bit are derived from the DRAW_INDIRECT_BUFFER binding.

Then he’s wrong. There is nothing in GLSL or OpenGL about where atomic counters can be used.

There is nothing, that’s true. But if you check the extension specs (or the core spec) you can see that the extensions require a minimum of 8 load/store images and atomic counters only for fragment shaders (MAX_FRAGMENT_IMAGE_UNIFORMS and MAX_FRAGMENT_ATOMIC_COUNTERS), but the required number is 0 for all other stages. It’s not a coincidence that there are some GL 4.2 capable GPUs not supporting them in all shader stages (at least currently).

Not to derail the thread, but this is perfect example of why many folks (not just peterfilm), including me, are hesitant to wade into the GLSL “side-effect” waters. For folks that have cooked OpenCL or CUDA kernels, this opens up the same issues you have to deal with there … definitely not an pool to dive into lightly (watch out for the sharks!).

I need to see more complete GLSL side-effect example code before I go hacking down that road.

(Maybe some year there’ll be a Expert OpenGL Techniques class at SIGGRAPH that’ll cover this in detail… (hint hint). Anyway, we now resume your current program already in progress…)

I need to see more complete GLSL side-effect example code before I go hacking down that road.

It was once suggested to me that down-sampling a texture is best done with image load/store instead of using the convenient glGenerateMipmap() - I didn’t try it yet but it was suggested by an AMD driver developer (not aqnuep however :slight_smile: ). Also, you can apply filters without doing ping-pong rendering as in the case of applying multiple iterations of a blur filter since incorporating already altered pixels when determining the value of the next one is acceptable. To cope with instruction limits one could tile the the full-screen quad and have GPU perform filtering on the tiled regions - not sure exactly if that’s permissible mathematically thinking of applying kernels in a undeterministic way with multiple tiles.

well i asked for the limits on the quadro 4000, and got:-
GL_MAX_VERTEX_ATOMIC_COUNTERS: 16384
GL_MAX_GEOMETRY_ATOMIC_COUNTERS: 16384
GL_MAX_FRAGMENT_ATOMIC_COUNTERS: 16384

so i tried it, using atomic counters i mean, backed by a buffer.

results (sorry, fps again):-

instances: 25798
triangles: 186696
GPU: 705fps
CPU: 410fps

pretty damn good!
I know this isn’t a real stress test, but i’m having trouble with the tool that generates the instances…can’t get enough of em to produce a realistic load.


#version 420 core


#ifdef GL_VERTEX_SHADER


in vec4 attrib_row1;		// xyz=axisX, w=translationX
in vec4 attrib_row2;		// xyz=axisY, w=translationY
in vec4 attrib_row3;		// xyz=axisZ, w=translationZ
in vec4 attrib_bsphere;		// bounding sphere xyz=centre, w=radius


out vec4 vsRow1;
out vec4 vsRow2;
out vec4 vsRow3;
flat out int vsVisible;


uniform vec4 uni_frustum[6];	// the 6 world space frustum planes


void main() {
	vsRow1 = attrib_row1;
	vsRow2 = attrib_row2;
	vsRow3 = attrib_row3;


	vsVisible = 1;
	
	// is instance in frustum?
	for (int i=0; i<6; ++i) {
		float d = dot(uni_frustum[i], vec4(attrib_bsphere.xyz, 1.0));
		if (d <= -attrib_bsphere.w) {
			vsVisible = 0;
			break;
		}
	}
}


#endif


#ifdef GL_GEOMETRY_SHADER


layout(points) in;
layout(points, max_vertices = 1) out;


uniform vec3 uni_camPos;		// xyz=world space camera position
uniform vec4 uni_lodDist;		// lod distances for x=lod0, y=lod1, z=lod2, w=lod3


in vec4 vsRow1[1];
in vec4 vsRow2[1];
in vec4 vsRow3[1];
flat in int vsVisible[1];


layout(stream=0) out vec4 gsOut0Row1;
layout(stream=0) out vec4 gsOut0Row2;
layout(stream=0) out vec4 gsOut0Row3;
layout(stream=1) out vec4 gsOut1Row1;
layout(stream=1) out vec4 gsOut1Row2;
layout(stream=1) out vec4 gsOut1Row3;
layout(stream=2) out vec4 gsOut2Row1;
layout(stream=2) out vec4 gsOut2Row2;
layout(stream=2) out vec4 gsOut2Row3;
layout(stream=3) out vec4 gsOut3Row1;
layout(stream=3) out vec4 gsOut3Row2;
layout(stream=3) out vec4 gsOut3Row3;


layout(binding = 0, offset = 4) uniform atomic_uint LodCount0;
layout(binding = 0, offset = 24) uniform atomic_uint LodCount1;
layout(binding = 0, offset = 44) uniform atomic_uint LodCount2;
layout(binding = 0, offset = 64) uniform atomic_uint LodCount3;


void main() {
	if (vsVisible[0]==1) {
		float dist = distance(vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w), uni_camPos);
		if (dist < uni_lodDist.x) {
			gsOut0Row1 = vsRow1[0];
			gsOut0Row2 = vsRow2[0];
			gsOut0Row3 = vsRow3[0];
			atomicCounterIncrement(LodCount0);
			EmitStreamVertex(0);
		}
		else if (dist < uni_lodDist.y) {
			gsOut1Row1 = vsRow1[0];
			gsOut1Row2 = vsRow2[0];
			gsOut1Row3 = vsRow3[0];
			atomicCounterIncrement(LodCount1);
			EmitStreamVertex(1);
		}
		else if (dist < uni_lodDist.z) {
			gsOut2Row1 = vsRow1[0];
			gsOut2Row2 = vsRow2[0];
			gsOut2Row3 = vsRow3[0];
			atomicCounterIncrement(LodCount2);
			EmitStreamVertex(2);
		}
		else if (dist < uni_lodDist.w)
		{
			gsOut3Row1 = vsRow1[0];
			gsOut3Row2 = vsRow2[0];
			gsOut3Row3 = vsRow3[0];
			atomicCounterIncrement(LodCount3);
			EmitStreamVertex(3);
		}
	}
}


#endif



Just a minor observation:

float dist = distance(vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w), uni_camPos);

I can’t tell if it will have a significant impact in your case but if the range of values permits you could use square distance to get rid of the sqrt here:

vec3 distVec = vec3(vsRow1[0].w, vsRow2[0].w, vsRow3[0].w) - uni_camPos;
float sqrDist = dot(distVec, distVect);

If course you’ll have to account for that during LOD selection as well, i.e. store squared distances in uni_lodDist.

yup, i know, this is a simple test - i found early on that it made no real difference to performance on the GPU but did on the CPU so I decided to leave it with true length on both implementations to make it fair. :wink: