Grass shader

I want to optimize rendering of the grass field. Currently i am using instanced rendering to draw it, vbo of transform matrices and not duplicating by geometry shader. The idea is to cull grass by radius from the camera(can be optimized with bvh tree) and do frustum culling on the cpu and send an array of ints if its visible or not as another vbo. Can i access each instance of mesh in geometry shader and simply not emit it? Is this a good idea?

If you pass your instance list down to the geometry shader, you can cull your instances and assign them to LOD bins in a pre-pass prior to your instanced geometry renders. This can be useful when you have a few things (100s-1000s) that are each expensive to render (especially at higher LODs) and you don’t want to cull and LOD them on the CPU.

However, generally speaking the geometry shader is slow. If you only have 1 LOD and the cost of transforming a single instance is small (vertex shader work), it makes little sense to try and protect against that work using a per-instance geometry shader.

Have you tried culling batches of grass on the CPU and just faded out the clumps before they cull out?

There are no lods, grass model is already simple and the culling is done on the CPU, however, there is little optimization (5 fps).
Lod can be made by drawing another grass over this one at lower density as separate draw calls (from 0 to 20 draw high density grass, from 20 to 40 draw low density grass).
Here is what i got so far.
Vertex shader:

#version 330

//attributes
in vec3 position;
in vec2 uv;
in vec3 offset;   //array of offsets, instanced rendering
in int visibility;  //array of visibility, 0 not visible, 1 visible

//directional light
struct DirectionalLight{
    vec4 color;
    vec4 specular;
    vec4 direction;
    mat4 matrixA;
    mat4 matrixB;
    mat4 matrixC;
    mat4 matrixD;
    float intensity;
    int useShadows;
};

layout (std140) uniform perCamera{
    DirectionalLight dirLight;
    vec4 cameraPos;
    mat4 cameraMat;
    mat4 cameraMatInverse;
    mat4 projectionMat;
    mat4 projectionMatInverse;
    mat4 projectionCameraMatInverse;
    vec4 ambientLight;              

    float minimumAmbient;
    float zNear;
    float zFar;
    int fps;
} pc;

/////////////////

//OUT
out vec2 uv0;
flat out int vis0;

//uniforms
uniform mat4 entityMat;     //matrix of entity transform
uniform float time;

uniform vec3 obstacles[50];
uniform float obstaclesRadius[50];
uniform int numObstacles;

void calculateObstacle(inout vec4 worldPos, in float radius, in vec3 obs){
    float dist = distance(obs, worldPos.xyz);
    float circle = 1.0 - clamp(dist / radius, 0.0, 1.0);

    vec3 sphereDisp = worldPos.xyz - obs;
    sphereDisp *= circle;

    vec3 dir = normalize(worldPos.xyz - obs);
    worldPos.xz += sphereDisp.xz* 2.2;
}

void main(){
	//vertex transform
	vec4 worldPos = entityMat * vec4(position + offset,1);

    //iteractive grass
    if(position.y > 0.5){
        for(int i = 0; i < numObstacles; i++)
            calculateObstacle(worldPos, obstaclesRadius[i], obstacles[i]);

        worldPos.x += sin(time * 1.2) * 0.08;
        worldPos.z += cos(time * 1.2) * 0.08;
    }

    gl_Position = pc.projectionMat * pc.cameraMat * worldPos;

	//out
	uv0 = uv;
    vis0 = visibility;
}

Fragment shader:

#version 330

uniform sampler2D samplers[4];

in vec2 uv0;
flat in int vis0;

layout(location = 0) out vec4 out_g_worldNormalSpecPower;
layout(location = 1) out vec4 out_g_albedoSpecIntesity;
layout(location = 2) out vec4 out_g_unusedShadeless;

void main(){
    //if(vis0 == 0) discard;  //the optimization

	vec4 difColor = vec4(1,0,0,1);
	vec4 specColor = vec4(1,1,1,1);
	float specularIntensity = 0.3;

    difColor = texture2D(samplers[0], uv0);
	if(difColor.a < 0.2) discard;

	//out to g buffer
	out_g_worldNormalSpecPower = vec4(0,1,0, 10.0);
	out_g_albedoSpecIntesity = vec4(difColor.xyz, specularIntensity);
	out_g_unusedShadeless = vec4(0,0,0,0);
}

Simple cpu culling for testing:

void GrassRenderer::updateVisibility(const glm::vec3& camPos){
	auto trs = thisEntity()->transform()->getTransformMatrix();
	for (unsigned i = 0; i < _offsets.size(); ++i) {
		glm::vec3 point = trs * glm::vec4(_offsets[i], 1.0f);

		if (glm::distance(camPos, point) < 20.0f)
			_visibility[i] = 1;    //will change this to float and add some interpolation so instances on edge scale and fade instead of popping.
		else
			_visibility[i] = 0;
	}
}

The idea was to skip instances that are not visible from rendering, i thought that geometry shader could do this.
This is what i wanted opengl - Draw selected instances of VAO (glDrawArraysInstanced) - Game Development Stack Exchange.

Here are some results from rendering 5000 instances on gtx850m.
no%20opt opt

Is there any better way to optimize this?

Thanks:).

Each instance has overhead. So if you only draw very few triangles per instance the end result will have bad performance compared to the amount of triangles you draw.

I was trying to find some older benchmarks I think tera nova engine did on desktop PCs for this kind of stuff. But was unable to find them. From memory I think you have to get in the range of a few thousand triangles per instance. AMD cards preferring a higher triangle count then Nvidia.

Found it, was the outerra engine: Outerra: OpenGL rendering performance test #1 - Procedural grass