I’m getting round to adding instancing support to my engine primarily because we read about it so much in the DX world and it seems like a good idea in principle. However, I’m starting to have my doubts in real world scenarios (performance related issues).
I’ll try and explain my thinking below…
I’m adding instancing support to render relatively simple OBJ type models (think trees on a terrain). Ignoring the problem of LODding the models with camera distance, I want a simple technique to plumb into the engine, and as I see it there are three techniques to choose from:
- Uniform Buffer Objects
- Texture Buffer Objects
The per-instance data I’m trying to plumb into the engine is a modelview matrix per instance and all three techniques could be used in principle to solve this.
So which technique to use?
1. Uniform Buffer Objects
This is actually harder to plumb into the engine than I first thought. I’ve modified my underlying shader library to support Uniform Blocks but I have to track which shader is accessing which UBO because if a shader is recompiled I have to issue a glUniformBlockBinding to set the block uniform binding points for that shader.
Additionally, the memory layouts are a pain and the application needs to track offsets to packed uniforms within the block. Finally, there is a limit on it’s size anyway – which may or may not be an issue.
I’m having difficulty coding up a suitable generic solution for Uniform Buffer Objects, so I’ll have to defer on this for now.
2. Texture Buffer Objects
These are a dream to work with; accessed just like a texture and simpler to create than vertex buffer objects. Dead easy to plumb into an abstract library which my engine is built upon.
Two TBO are created; one to hold the entire set of modelview matricies for all 400 instances; the other to hold an index list of which [modelview] instance to render this frame.
The index TBO is updated each frame to include the index [into the modelview TBO] of the models which have been determined to be visible.
During rendering, glDrawElementsInstanced is called and the vertex shader performs a TexelFetch on the uSampleBuffer uniform to fetch the model index. Using this model index, 4 more TexelFetches are performed to read the complete modelview matrix.
Here’s a snippet from the vertex shader:
[b]//uniform mat4 modelmatrix; //replaced with texture buffer object - instanced rendering
uniform samplerBuffer modelmatrixbuffer; //RGBA32F
uniform usamplerBuffer renderlistbuffer; //R32UI
int offset = 4 * int(texelFetchBuffer( renderlistbuffer, gl_InstanceID).r); //get the real batch instance from the render list (supplied as an integer texture buffer)
// offset = int (gl_InstanceID * 4); //matricies are indexed as blocks of 4 RGBA floats
modelmatrix = texelFetchBuffer( modelmatrixbuffer, offset);
modelmatrix = texelFetchBuffer( modelmatrixbuffer, offset+1);
modelmatrix = texelFetchBuffer( modelmatrixbuffer, offset+2);
modelmatrix = texelFetchBuffer( modelmatrixbuffer, offset+3);[/b]
Part of OpenGL 3.3, but the extension has been around on ATI drivers as an ARB extension for a while. This method allows us to upload per-vertex streams to OpenGL but these streams are increamented only once per object instead of per vertex.
My assumption is that this technique is more efficient than the other two – they have to lookup 5 texels/uniforms per vertex which is hurting performance. With this technique we are sending more per-vertex data stream (4 * RGBA floats), but there is less work per vextex – so this should result in faster rendering.
[b]//uniform mat4 modelmatrix; //replaced with 4 per-vertex attribute streams - instanced rendering
attribute vec4 modelview1;
attribute vec4 modelview2;
attribute vec4 modelview3;
attribute vec4 modelview4;
modelmatrix = modelview1;
modelmatrix = modelview2;
modelmatrix = modelview3;
modelmatrix = modelview4;[/b]
Compared to drawing 400 instances individually:
TBO technique is slower (ATI Radeon 4850, Quad core processor 2.6GHz, OpenGL 3.3/4.0 beta drivers and also on my nVidia GT8600m Laptop) - roughly 33% slower.
Instanced_Arrays is significantly slower (can only test on ATI – since nVidia mobile drivers don’t yet support ARB or GL 3.3) – roughly 75% slower.
I can’t put this down to beta drivers since the ARB extension has been around on ATI drivers for some time. The only performance point I can make is that I don’t cull any verticies before drawing with technique #3; in other words I just draw all 400 models whether they are in camera view or not. I intent to perform more tests where by I cull away non-visible models and then upload to the VBO only the model matricies of visible objects. The downside to this is the extra time spent copying memory.
Instanced rendering is not worth the effort and provides no benefits (real world).
I guess for specific cases where many 1000’s of objects would be drawn (eg asteroids in a space simulator), there may be some benefit.
Anyone else had similar experiences they wish to share?