How to render multiple times a mesh with N rectangles

Hello, I’m trying to find the most efficient way to render a mesh made up of 1000 rectangles, and then draw that same mesh multiple times at different world positions using instancing.

Each rectangle has:

  • A local position in the range [0, 255] for x, y, z
    
  • A size (width and height) in the range [0, 255]
    
  • An RGB color in [0, 255]
    

So far, I’ve set up:

  • VBO containing the four corner positions of a unit rectangle
  • EBO with indices {0, 1, 2, 1, 2, 3}
  • Instance VBO with per-rectangle data packed into 8 bytes each:
    Position: x, y, z (3 bytes)
    Size: width, height (2 bytes)
    Color: r, g, b (3 bytes)

I render the mesh once with:

glDrawElementsInstanced(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0, rectCount);

My questions are:

  1. Is this a good, performant approach for rendering 1000 rectangles at once?
  2. If I want to draw the entire mesh M times at different world locations, should I simply call:
glDrawElementsInstanced(GL_TRIANGLES, 6, GL_UNSIGNED_INT, 0, rectCount* M);

and bind a second VBO with per-mesh-instance data (e.g. world position)? If so, how would I set up my vertex shader (and VAO) to consume both the per-rectangle attributes and the per-mesh-copy attributes in one draw call?

Is there another way to handle this? I’m not sure about the performance of it.

The idea is to glMultiDrawElementsIndirect this in the end but I want go step by step.

Thanks!

Older implementations didn’t coalesce multiple instances into a single workgroup, so if you have an instance with 4 vertices each workgroup will only use 4 “cores” and leave the rest idle. I’m told that this isn’t the case for newer implementations, but can’t personally confirm it.

So you might get better performance if each instance contains multiple rectangles.

Ultimately, performance questions can only accurately be answered by benchmarking on the target platform(s).

For point 2, that approach will work; the attribute divisors would be 0, 1 and rectCount respectively. The attribute arrays would need to have 4, rectCount* M and M elements (i.e. the per-instance data would have to be repeated for each copy).

Alternatively, you could use a UBO/SSBO indexed using gl_InstanceID%rectCount instead of an attribute, but this will have a performance cost. You’d have to profile it to see how much.

Thank you GClements but when you say older implementations what does this mean?
Pre gl 3.3?

It’s not tied to a specific version of OpenGL, it’s down to the hardware and driver version.

You’ll just need to test the performance on the hardware you’re interested in. If you’re planning on supporting a wide range of OpenGL 3+ hardware, you should assume that at least some of it will deal inefficiently with small instances (“small” in the sense that the number of vertices is much less than the workgroup size).