I’ve been playing around with OpenGL and am struggling with a ‘best coding practices’ type question regarding VBOs. I’ll try to describe the problem I’m having. I don’t currently have any code worth sharing since this I’m still trying to make some long-term design decisions. If you need code, I can write some up quickly and post it.
Say I have 100 moving cubes I’m rendering in a scene all of which share the same mesh. Which of the following makes more sense?
- Create one large VBO (large enough to store 100 instances of the cube mesh)
- For each cube, apply a transform to the mesh to move it to the correct position in world space (cpu-side) and add its vertex data to the VBO.
- Make one glDrawElements call to draw the entire scene.
- Clear the VBO (or ping-pong between a second VBO) and repeat from #2 for each subsequent frame.
- Create one small VBO (just large enough to store a single instance of the cube mesh)
- Buffer the VBO with the pre-transformed cube mesh
- For each of the cubes, set uniforms in the shader to transform the cube into the correct position in world space (gpu-side) and call glDrawElements.
- Repeat from #3.
So: Option A vastly reduces the number of glDrawElements calls but the VBO must be recreated after each frame and all transformations are performed by the CPU. Option B never needs to update its VBO and all transformations are done by the shader but it requires many calls to the API.
Which would you suggest is the better option? Or are there other options I’m completely missing?
I’d do it like this:
store only the vertices, normals and so on for a single cube in vbo.
use a loop for each cube, glTranslate, glRotate, glScale and call glDrawElements for each (you can use shaders to do the transforms).
But the fact that you specifically need the transformed vertices and normals on CPU side (for local calucations like collides and such), there’s no need to duplicate them in the GPU.
Look at geometry instancing too; it would be suitable for your use case.
Thank you both. If I were to pursue geometry instancing would it be difficult to roll my own rather than rely on GL_ARB_draw_instanced?
I’d be inclined to use a for loop to index into a uniform array for the appropriate instance’s position, normals, etc. but if I understand from some looking around, loops with a non-fixed number of iterations don’t work in shaders. Is that correct? If so, are there better ways of working around that or should I just define a maximum number of elements the shader can handle and use a fixed number of iterations?
I’d be inclined to use a for loop to index into a uniform array for the appropriate instance’s position, normals, etc.
Shaders don’t work that way. A vertex shader takes one set of vertex attributes and outputs the values for a single vertex. One vertex in, one vertex out. You can loop over whatever you want, but a vertex shader can only generate a single vertex.
Geometry shaders have a bit more freedom, but instancing is an optimization. And geometry shaders aren’t… optimal. Particularly when you’re generating a bunch of geometry out of whole cloth like this.
It’s best to work with what the two instancing extensions provide. You have your pick of getting your per-instance data from attribute arrays (ARB_instance_arrays), or getting your per-instance data as a numerical input value that you use to possibly index a uniform array or whatever (ARB_draw_instanced). Both are core, and you can even combine them where appropriate.
Using a uniform array is certainly possible. There’s a D3D demo incorporating it as an example way of doing instancing, and I’m certain that the concepts would translate to OpenGL almost one-for-one. More info is available here (Technique 2: Shader Instancing (with Draw Call Batching) is the one).
This won’t perform as well as true hardware instancing, of course, and has limits such as how many uniforms you can have at any one time. You’re better off just using the extensions which will be supported on more or less any non-paleolithic hardware.
sorry to burst your bubble; but instancing is not going to help you in the slightest here. 100 cube instances is not nearly enough to give you any payback. You need closer to 10,000 instances!
if you are using compatability profile i’d just stick to a gltranslatef loop over the 100 ‘instances’ (you could slightly optimise with using glloadmatrix instead of pushmatrix/translatef/popmatrix).
If, on the otherhand, you are using a shader to render the cubes, the alternative is to supply the modelmatrix for each ‘instance’ using a gluniform* call.
@mhagain: Paleolithic (maybe Neolithic?) hardware is actually a concern of mine. I’m trying to target everything back to integrated Intel media accelerators (machines supporting at least OpenGL 1.4 + ARB_vertex_program).
@BionicBytes: If I’m not anticipating that order of magnitude of instances (and I’m interested in supporting old hardware) I suppose I should scrap geometry instancing and just go back to Option B from the original post.
Yes option B is a reasonable approach given the limited instances and target hardware.
It depends on where your bottleneck is really. On some hardware A might be faster, on other hardware B might be preferable. Since you’re going back to OpenGL 1.4 level you can’t assume the presence of VBOs either, so you’re going to need to factor in use of client-side vertex arrays too. Based on that option A certainly starts to look more attractive - no need for dynamic VBOs or ping-ponging as you’re in client-side memory anyway, the Intel kit you’re talking about targetting (at 1.4 level - e.g. the 945) emulates the vertex pipeline in software so you lose nothing by doing your own transforms in software, and you get to reduce draw call overhead.
Longer term I’d actually code both options, run some dummy frames at startup to profile them (skip the first few frames here as things may still be settling down a little) and select the best depending on the result.