Instance Shader proposal

I had posted it some time ago in a rather hidden place here, this is a bump.
Uses the strengths of ZCULL/HiZ, and is probably easy to add in future silicon:

// this instance-shader is called once per instance
// all of these uniforms below are user-specified, not expected by GL
// added tokens: gl_OcclusionBBMin and gl_OcclusionBBMax

uniform mat4  uniMVP; // matrix projection-view, or projection-view-world (in case of portals, clustering)
uniform vec4  uniFrustumPlanes[6];
uniform float uniBoundingSphereRadius;

bindable uniform vec3 buniInstancePosition[]; // element at index gl_InstanceID is used here
bindable uniform mat3 buniInstanceRotation[]; // element at index gl_InstanceID is used here

uniform vec3 uniBoundingVolumeVerts[3*12]; // a convex box in this case. Could be something more obscure. Could be dependent on gl_InstanceID. 

void main(){
	vec4 pos = uniMVP * buniInstancePosition[gl_InstanceID];
	mat3 rot = buniInstanceRotation[gl_InstanceID];
	mat4 nodeTransform = uniMVP * m_Make4x4FromPosAndRot(pos,rot);
	vec4 minXYZW = vec3(1.e+5,1.e+5,1.e+5,1.e+5);
	vec4 maxXYZW = vec2(-1.e+5,-1.e+5,-1.e+5,-1.e+5);
	//------[ secondary rough occlusion test via a lowest-poly mesh ]--------[
	// a box, consisting of 12 triangles is used here, and 12 can be the 
	// imposed maximum count of triangles to test occlusion with.
	// Uses ZCULL and optionally EarlyZ
	// (ZCULL being roughest, fastest z-culling test,
	//  EarlyZ being fast but less rough z-culling test)
	for(int tri=0;tri<12;tri++){
		for(int v=0;v<3;v++){
			vec4 vpos = nodeTransform * uniBoundingVolumeVerts[tri*3+v];
			gl_Position = vpos;
			minXYZW = min(minXYZW,vpos);
			maxXYZW = max(maxXYZW,vpos);
	//----[ primary, roughest occlusion test via a screen-aligned quad ]---------[
	// uses only ZCULL. If it doesn't pass ZCULL, the secondary test is skipped. 
	gl_OcclusionBBMin = minXYZW;
	gl_OcclusionBBMax = maxXYZW;

bool m_ClipSphereByFrustrumPlanes(in vec4 pos){
	// here use uniFrustumPlanes and uniBoundingSphereRadius to do preliminary frustum culling

mat4 m_Make4x4FromPosAndRot(in vec4 pos,in mat3 rot){
	// some maths

The triangles from the secondary rough-occlusion test do not modify z-buffer, color-buffers or stencil-buffer. Those triangles are not further transformed by the currently-bound vertex shader, and do not use the currently-bound fragment shader. The result from the shader is a single bool (stored internally in a bit, byte, int). The shader is executed before drawing the mesh-instance, or preemptively executed for several mesh-instances. The latter version improves usage of paralellism, but can give false positives (i.e instance 4 is occluding instance 7, but #7 being regarded as visible, as we have batch-computed the visibility of instances 0…10).

Further improvement:
Addition of “int gl_IBO_FirstIndex=0”, “gl_IBO_Length” and “int gl_VBO_FirstIndex=0”, to specify what range of the VBO and IBO (index buffer) this mesh-instance should use.
This can be used to let the shader select a LOD version of a model, or use a different model altogether (but still with the same bound shaders, render-states and render-targets).

Further optional improvement:
Have the gpu write results from the instance-shader to a byte-buffer-object, created by the user. That buffer is initially reset to “true” for all instanceIDs, and is required to be at least NumInstances big. If an object is occluded (as decided roughly by the instance-shader and its querying of ZCULL via those triangles), then gl_InstanceVisible[gl_InstanceID]=false; . The user can then use glMapBufferRange to retrieve occlusion info.
This can be used as feedback on which instances were drawn, and to do cpu-side computation regarding the result.

it looks more like a geometry shader doing the job of the object shader i have proposed may times, it doesn’t seem that efficient.

My version is pretty easy as it’s pretty similar to a vertex shader but instead of importing a single value you import an array.

in vec3 vertexBuffer[];

then you feed it to a uniform like this.
the values in the buffers are offsets

out vec3 pos[]=vertexBuffer[43];
out vec2 texcoord=texcoordBuffer[43]; // this would output the same variable to all generated vertics

the last command emits 16 vertics from the arrays to the output.

this would allow you to do any instancing, skinning and whatever you like

I think you misread my post. Culling of the whole current object instance is a critical part of it.

yea, i know, it’s just that i think there are better ways of doing it, culling is still possible in my method but it’s not a critical thing as you should be doing macro level culling.
The rasterizer will take care of the rest.

As far as i know early z is done after the rasterization

For whole instance culling I think occlusion culling and conditional render give already some good results, the query latency could be hidden or handle nicely. It actually fits well current rendering engine design.

This method doesn’t seem perfect but I’m not sure that your solution give any advantage. And yes, early z is done after rasterisation so you end up to process all the vertices, all the primitives so that a method based on geometry shader culling might be more efficient.

But ZCULL is done before rasterization.
Instancing cannot skip drawing of selected instances, and doing the frustum culling on the cpu makes you either rearrange instance data in the (huge) per-instance UBO or have your shader access UBO data via a second indirection:
int index = uniIndices[gl_InstanceID]; // instead of directly via gl_InstanceID
In both cases, computing arrays and uploading them to the gpu cannot be avoided.

How to you except early z to work before rasterization? It’s a per fragment operation like usual z-test but done earlier that z-test.

Simply because ZCULL is not EarlyZ, and it’s not per-fragment but per-primitive.
EarlyZ is before a fragment.
Alpha-test (clip/discard) disables ZCULL until next glClear(depth) and decreases performance of EarlyZ.
Modifying gl_FragDepth ignores EarlyZ.

I would love to read something about this ZCULL and how it works. Never read anything before about it.

ZCULL is not per-primitive: it is per-group of fragments.

These shaders will have to run very fast to be worth the effort. Standard fragment culling will cull all of the fragments for off-screen fragments. That means the length of the fragment shader is not relevant. The only savings that this could provide is executing vertex shaders.

Instance lists are meant to be generated per-frame. Buffer Objects have properties that make this generation easy (STREAM_DRAW) and performance useful.