Improve performance: Render 100000+ objects

We prepared a little test project (VS 2010, C++, OpenGl, Freeglut, Glew) to publish our solution to the following main requirements:

  • Rendering 100000+ objects at the same time, each one of those independently accessible/seleccionable to be able to change their properties.

  • The objects are of a certain type that defines their general shape and properties, which is also changeable during execution.

  • Use OF VBOs, Shaders, OpenGl 3.3

  • Should work on low-level graphic cards.

This is a very simplified version of our actual project, we are aware of some general methods to improve overall performance such as rendering only the visible objects and not those which are temporarily outside of your frustum.

For this example we took those techniques and other features out to make it smaller and easier to understand.

But please don’t hold back with anything that comes to your mind and works for you, we might as well have missed something obvious.

Any constructive criticism, feedback and/or information, tips to improve the performance are welcome.

On our computers (Intel i7-2600, CPU @ 3.40, Ge Force GT 220) we render 100000 objects at about 6-8 frames/sec.
1000000 objects at 3 frames/sec.

It would be nice to improve the performance to get close to 20-25 frames/sec, although it might simply not be possible.

General Info:

Use your left mouse button to rotate the camera and A, D, W, S to move it to the left, right, up or down.

EDIT: What we are looking is a way to improve the performance. Is our approach a proper way to render 100000 objects? We tested a lot of different ways and the reason why we chose this solution is because it gave us the best overall performance, but as I said, maybe we missed the obvious and did not use the “standard” OpenGL way of solving this problem. We could not really find a lot information about a project where the requirements were to render this amount of objects.

Hmm, perhaps you could point out what the gist of your solution is, because from taking a quick look over the code you seem to just brute force render all objects? So I guess I’m missing something.

One optimization that comes to mind is to avoid activatting the same VAO over and over again if your draw stretches of objects of the same type.
Then the general advice to minimize state change suggests that you should sort objects by type (unless there is more expensive to change state that is common to many objects).

If you want to cut down the number of draw calls you could upload the model matrices and color info into buffers and use instancing to render all objects of a type in one call.

I have no idea what kind of objects you are trying to render, but you cannot display 100000 high-detailed objects on the screen all at once.
All performance boosting techniques will raise frame-rate:

  • view-frustum culling,
  • occlusion culling,
  • LOD,
  • early-Z/pre-Z/fast-Z (rendering order),
  • bindless (NV specific) or display lists,
  • instancing,
  • grouping by state,
  • etc.

The simplest improvement I have found is to reduce the number of draw calls which means grouping a number of objects in a single vbo. There is no hard and fast rule but I try to keep the number of render calls below 2000. It may not give you all you want but will go a long way.

As Aleksandar says there are lots of things you can do - but only do one at a time or you will go insane.

If you have that number of objects, and it should be possible to select and manipulate each of them, then I take it that it can hardly be possible to see all at the same time. That is, they have to be big enough to make it possible to see them, so that you can manipulate them.

If that is the case, then it means quite a lot of objects in the list are occluded most of the time. So the main focus should be on algorithms to determine what is occluded, and can be ignored.

One possibility is to use queries. These can tell you, after the drawing is complete, what objects were actually visible. These objects can then be ignored until the view changes. It might not be feasible to define one query for every object, so some kind of grouping is needed. This solution doesn’t work well if you change the view or camera position a lot, but we don’t know what requirements you have.

I have a similar problem, with more than a million objects, but with a camera that moves a lot. In my case, I could group the objects into cubes representing the outer boundary. I then sorted the cubes on distance to the camera. The number of cubes were now down to a couple of hundred. The drawing algorithm takes 20 cubes at a time, draws all objects in each of these cube, use query tests for the next list of 20 cubes (only drawing invisible cubes this time to speed it up), and then use that query result to determine which of these cubes (and so their internal objects) shall be drawn in the next iteration. I could have done it with one cube at a time, but the query usually takes some time to finish.

100000 objects

My first question would be how complex these objects are. Are we talking 100, 1k or 10k+ vertices per object? I always bring this up as a comparison. I tested a mobile GF 8600M GS, which is arguably incapable of doing concurrent, sophisticated stuff. Still, that thing pushed through 3 Mio. vertices at rougly 30 FPS per second. However, the vertices came from very few, very detailed objects thus the draw call overhead was low.

My second question thaen would be, how exactly do you store your vertex attriutes and how exactly do you draw?

Edit: @tonyo_au: you will go insane doing that stuff anyway. the question is if you’ll be insanely happy or insanely frustrated. :wink:

First, thank you for your answers, I would have liked to edit my post, but I seems to have lost the button somewhere,
so I reply here to everyone to summarize it up and keep it all in 1 post.

@ Carsten

Imagine looking at a huge parking space from above, viewing more then 100000 cars.

So now the cars are all visible at the same time, they have to be selectable so that you could change their properties (like their CarType, make them all Ferrari’s if you like or change their colors) and you zoom in, you see them in full detail.

@ Aleksandars and tonyo_au

Considering the performance techniques mentioned, we have tried and also implemented most of them:

  • view-frustum culling

  • occlusion culling
    (works, but taken our for this example because our problem really exists whenever we have to render all 100000 objects at the same time)

  • LOD (we do render only lines and might actually try to render only points from a certain distance)

  • early-Z/pre-Z/fast-Z
    (there is no real Z in our example as the cars are all lined up on the ground)

  • bindless (NV specific) or display lists
    (In my understanding - Open Gl 3.3 does not work with display lists anymore)

  • instancing
    We will look into that, but after a short read 3 questions poped up in my mind:
    [b]- Are the objects still accessible individually then?

    • Is this a relativly new technique and would not work on low-level grapic cards?
    • Would that be the same thing as particles in OpenGl ES?[/b]
  • Grouping by state

    • This is a good idea, but in the tests we have made we didnt see any relevant improvements.
      For instance if you create only Cube objects (take the if… %2 for the Pyramids out), delete the glBindVertexArray calls in the render method of the cube and put the glBindVertexArray call in the DrawScene method, like this:

[b]void DrawScene(void)

for(int id = 0; id < arrShapesSize; id++)
arrShapes[id]->Render(globalShapeRenderingCounter, ModelMatrixUniformLocation, ColorUniformLocation, ChangeColorUniformLocation);


if (globalShapeRenderingCounter > arrShapesSize) { globalShapeRenderingCounter = 0; }


we couldnt see any essential improvements.

@ Kopelrativ

As I mentioned above, whenever we zoom in and we only see some (even 40000 objects) it works fine.
But when we see them all at the same time we have got a problem. You way of rendering sound really good, maybe we can take some overhead off by grouping the objects up whenever we check if they are in the Frustum or not.

@ Thokra - The objects are not very complex, it purely the amount of objects that gives us the problem. Also if the complexity raises, we could always take details out, until the user zooms in and is actually able to recognize those details.

it purely the amount of objects that gives us the problem

So, how many draw calls at max? How many vertex array binds? How many shader program changes? How is lighting done?

  • Draw calls: 1 glDrawElements(GL_TRIANGLES, …) and 1 glDrawElements(GL_LINE_LOOP, …) for each object each Frame -> 200000 calls each Frame.
  • Vertex Array Binds: 1 each object each Frame -> 100000 calls
    BUT - we tried to reduce this to 1 call each frame, bind the VBO and render all objects, which gave us no significant improvement (see above)
  • Shader Program: We got only 1 shader -> 1 call per frame, could also just bind and never change it I suppose.
  • Lightning: No special lightning required.

200000 draw calls and 100000 VAO binds are way too much! It’s pretty safe to assume that your app is entirely CPU bound and even your low-budget GPU will not be fully utilized.

I’m sure you can identify groups of objects which can be drawn in a single batch. This is the case if the same shader program can be active and if the objects are stored in the same buffer. You can bind the buffer one time, or very few times depending on the total number of buffers, activate the shader program one time and then draw.

You may also inspect if geometry instancing is an option if you design permits that.

Look into

and other corresponding calls.

Thx for the link I will look into that right away. Now geometry instancing was mentioned before too, maybe you can also answer this questions that came to my mind after a short read:

  • Are the objects still accessible individually then? or is it rather a grouping up of objects into 1 object.
    We need to be able to select them individually and lets say change their color all at once or only from the first 5 rows.
  • Is this a relatively new technique and would not work on low-level graphic cards?
  • Would that be the same thing as particles in OpenGl ES?
  • Are the objects still accessible individually then? or is it rather a grouping up of objects into 1 object.

Separate instances are identified by an unsigned integer specified for each instance by gl_InstanceID.

  • Is this a relatively new technique and would not work on low-level graphic cards?

Its GL 3.1 stuff, so it’ll work on any GF8+ card.

  • Would that be the same thing as particles in OpenGl ES?

No idea.

I render 100000 polylines with an average size of 5 vertices in < 2ms on an ATI 5870

I create the polylines roughly like this code. Where I have a single vertex buffer and index buffer

 unsigned int* mappedIndex = c_VBO_Pkt.IndexMap;
 unsigned int idx = c_VBO_Pkt.VertexOffset;
 for (unsigned int i = 0; i < c_LineSegmentCount; i++)
    mappedVertex->x = p_V[i].x;
    mappedVertex->y = p_V[i].y;
    mappedVertex->z = p_V[i].z;
    mappedVertex->rgba = rgba;
    *mappedIndex++ = idx++;
  // put in an object terminator
  *mappedIndex = 0xFFFFFFF7;

I render with a single call



I don’t know if this works on OpenGL ES; but I don’t believe it is a new feature to standard OpenGL