Culling with a compute shader

Previously I used a geometry shader and transform feedback to perform frustum culling on a large number of vegetation instances (1,000,000+).

In the new revision I am attempting to replace this with a compute shader. In the code below, I think I am writing the instance IDs out correctly, but I do not feel good about the way the draw commands structure is being written to.

Any suggestions?

These are inconsistent. Atomic operations are read/modify/write operations.

Also, your process seems to lack parallelism. Presumably, you want each CS invocation to conditionally add an instance, but that’s not what this code is doing. Indeed, nothing in your code is using the CS invocation index, so all CS invocations will do the same work.

And every instance of your compute shader is overwriting drawcommands’s other members. Only one instance should do that.

Thanks for pointing those things out:

// something like this...
id = gl_WorkGroupID.y * gl_WorkGroupSize.x * gl_NumWorkGroups.x + gl_WorkGroupID.x * gl_WorkGroupSize.x + x;
    if (gl_GlobalInvocationID == ivec3(0))
        drawcommands[0].count = IndiceCount;
        drawcommands[0].firstIndex = 0;
        drawcommands[0].baseVertex = FirstVertex;
        drawcommands[0].baseInstance = 0;

What is the best way to reset drawcommands[0].instanceCount to zero at the start of each frame? I could use glInvalidateNamedBufferData / glClearNamedBufferData, or a compute shader that simply sets the atomic to zero with atomicCounterExchange():

#version 460

struct DrawElementsIndirectCommand
    uint count;
    uint instanceCount;
    uint firstIndex;
    int baseVertex;
    uint baseInstance;

layout(std430, binding = 1) readwrite buffer IndirectDrawBlock { DrawElementsIndirectCommand drawcommands[]; };

void main()
    atomicCounterExchange(drawcommands[0].instanceCount, 0);

I have a basic implementation working now:

My vertex shader looks like this:

void main()
    vec4 p; =;
    p.w = 1.0f;

    int id = int(instanceids[gl_InstanceID]);

    int y = id / resolution.x;
    int x = id - resolution.x * y;

    vec2 texcoord = vec2(x, y) / vec2(resolution.x, resolution.y);

    x -= resolution.x / 2;
    y -= resolution.y / 2;

    p.xz += vec2(x, y) * spacing;
    p.y += texture(heightmap, texcoord).r * 100.0f;

    mat4 cameraProjectionMatrix = ExtractCameraProjectionMatrix(CameraID, 0);
    gl_Position = cameraProjectionMatrix * p;

I am confused about how to handle LODs with this design. It’s not difficult to store an array of draw commands in the storage buffer, and increment the correct one’s instance count based on which LOD mesh should be selected, based on distance. But how can I write the instance ID’s into the instances buffer? I can only think of two ways, neither of which are very appealing:

  1. Reserve space in the instance ID buffer for the maximum number of instances that might be visible, for every single mesh. This would lead to a huge increase in memory usage.

  2. Run the culling compute shader multiple times in sequence, evaluating just one mesh each run. This would be much slower.