Instancing sucks ?

peterfilm · May 5, 2010, 12:46pm

your vertex attribute is offset by 1 byte into the currently bound VBO. Is that your intention?
are you drawing with a shader? if not, then you should be aware that generic vertex attributes do not work with the fixed function.

Alfonse_Reinheart · May 5, 2010, 1:04pm

In the real world we can’t just draw 1000 tree models on the terrain and be done with it - the GPU just can’t cope with all those verticies and complex pixel shader calculations/lighting.

Then don’t use instancing. Instancing is a simple tool for a simple problem. If your problem does not fit the problem that instancing is meant to solve, then I’m guessing instancing won’t solve the problem it’s not meant to solve.

If you can finesse or coerce your data to actually fit the conditions that instancing works under, good. But instancing is not a magical panacea that will solve every problem associated with drawing lots of things.

If my data were not instanceable, I’d next look to glDrawElementsBaseVertex to avoid extra buffer binds and format changes. State changes are something you’ll have to live with.

Alternatively, you could accept the CPU limitations and increase your mesh and shader complexity to the point where you’re GPU limited again.

danbartlett · May 5, 2010, 1:05pm

Also, according to spec (and apparently AMD implementation, but not NVidia), that code should fail in an OpenGL 3.2+ core profile, since the default vertex array object (the name zero) is deprecated.

to function, it would need this at the start:


glGenVertexArrays(1, &vao);
glBindVertexArray(vao);

arekkusu · May 5, 2010, 1:07pm

Yes-- assume all of the relevant shader etc state has been prepared, and that the positions array contains real data.

The intention was that the pointer is offset one byte into the VBO, and the attribute is 3 floats.

So-- what’s going to happen when this is drawn?

What does the spec say should happen?

What do you think should happen?

I’m asking this, re: glfreak’s hypothetical “glVBLayout()”.

Ilian_Dinev · May 5, 2010, 2:46pm

Random floats, producing a spiky triangle hedgehog soup. Try it . The num-primitives will be decreased by 1 if the buffer wasn’t created big enough.

It’s just a contiguous chunk of memory, which the GL implementation decides where to be kept and when. Nothing is magically optimized for you - it’s your task as a programmer to align and preprocess the data nicely, just like cpu-side data in any app.

@BionicBytes:
you never shared what your scene specs are. “33% faster” and “400 instances” tell nothing. Furthermore, just 400 instances aren’t telling much. Try 140k per frame (I’ve had code do this, @60fps).

Try this: store N instances’ unique data simply in uniforms, vec4 or mat4 arrays. Keep N*sizeof(InstanceData)<8192. Access them in a way that element offsets are easy to calculate.
You must always bear in mind that access to the per-instance data via uniforms-vs-UBOs-vs-TBOs-vs-InstancedArrays has its pros and cons - and shader-overhead per vertex. None is universal, but each can be important in certain situations.

Instancing does not suck, it’s just that GL’s drawcalls are already quite fast enough for most stuff.

peterfilm · May 5, 2010, 3:01pm

is everyone assuming people are coding to the 3.2+ core profiles?
btw, is there some currently maintained resource that shows which profiles are supported on which hardware? and which extensions are supported too?
there used to be the delphi3d.net repository, but that’s down.
and the glview database doesn’t seem to be being updated much (maybe due to the fact that it fails to launch a webmail client to send the report!).
http://www.realtech-vr.com/glview/

BionicBytes · May 5, 2010, 4:04pm

I didn’t see the point giving actual fps as they mean nothing. I was comparing two different techniques so a relative figure is all that is needed.
I can tell u that I render 400 instances in 3 different ways: normally, reflected and then shadowed - all part of the scene graph. I have tested again upping the instance count to 4096 (so that’s 12000+ instances in total across 3 draw calls). In this scene there are over 130 million visible triangles! Yes, that’s right- although the fps is only 14-20 whether instance arrays are used or not. The difference when not instancing via instance arrays is more CPU time performing the draw calls ( as measured by the engine) but strangley the app feels more responsive to mouse input resulting in smoother frames during camera panning and motion- despite similar draw calls.
I also tested 1600 instances (same scene engine- so 1600 instances for normal rendering and again during reflection and again shadowing for the sun light). Here no instancing was giving 50 fps for 80+ million tris and instancing ( arb arrays) was giving 33 fps.

So I disagree- instancing does suck!
Common sense suggests drawing everything thing in a single call should be quicker than the same with multiple draw calls- but it’s not.
I also don’t agree that instancing isn’t suitable for my needs. What could be more appropriate or simple than supplying x models with x modelview matricies. Surely the perfect instancing case?

Some one asked about core profile- no just compatible 3.3 profile on ati radeon 4850.

HAL-10K · May 5, 2010, 4:26pm

Try to upload the instance data with uniforms as it has already been suggested.

This gives me a significant improvement in a (game) scene with only a few dozen instances in avarage for a few hundred (instantiated) draw calls.

Ilian_Dinev · May 5, 2010, 4:56pm

Now that’s some nice data . The lack of mouse-input smoothness is imho related to the drawcall complexity, in my scenes I’ve seen it if I bake everything in several big meshes.

With the simple uniform-arrays trick, my instanced objects have absolutely the same performance as simple drawcalls at_minimum. And this saves a lot of cpu :). 10 mil poly/frame, 48fps , 30k instances total of (only) ~150 base meshes (encapsulated in VAOs actually). GTX275 msaa4 deferred 720p. (the 1 triangle/cycle limit is nigh). Non-synthetic scene/benchmark.
I recalculate/reupload instances’ data via glUniform4fv; If I needed instance-data-size to be higher than 100 bytes, I’d try UBOs/TBOs/IAs again (were slower, but I tried them just when they were announced), but for now I’m happy with this tiny almost-universal solution . Plus, for those bigger chunks I can pass gl_InstanceID to the frag-shader, which often fetches instance-data fewer times than the count of vertices in bigger scenes (after rough depth prepass).

Anyway, what I meant is that maybe IAs and TBOs have too high shader and/or driver overhead, around uploading and fetching.

danbartlett · May 5, 2010, 5:06pm

I was just going to test this + noticed glVertexAttribDivisor +GL_VERTEX_ATTRIB_ARRAY_DIVISOR aren’t included in gl3.h at the moment, even though they are part of core now since OpenGL 3.3.

I added this report at http://www.khronos.org/bugzilla/show_bug.cgi?id=299

Alfonse_Reinheart · May 5, 2010, 7:05pm

Common sense suggests drawing everything thing in a single call should be quicker than the same with multiple draw calls- but it’s not.

It is faster; that is, calling the function is faster. However, by implementing instancing, you have made your shader/rendering system do more work. So what may have been CPU bound now becomes GPU bound.

I also don’t agree that instancing isn’t suitable for my needs. What could be more appropriate or simple than supplying x models with x modelview matricies. Surely the perfect instancing case?

Um, no.

Instancing is intended to remove state change and draw call overhead when drawing large numbers of objects. That is, if your rendering is CPU-bound, it should provide a speed-up if you’re drawing a lot of things.

So before you can expect performance improvements, you need your rendering loop to be CPU-bound on state change and draw call overhead. This is why measuring FPS is not really the best way to test this kind of performance.

Once you’ve ensured that you are CPU-bound, you should then ensure that you are rendering enough instances for the draw call overhead gain to offset the loss from using less efficient means.

Ilian’s uniform array trick might work, though you don’t get very many uniforms to play with.

BionicBytes · May 6, 2010, 2:03am

It is faster; that is, calling the function is faster. However, by implementing instancing, you have made your shader/rendering system do more work. So what may have been CPU bound now becomes GPU bound.

Not sure about this statement. Whether instancing or not, the system still has to draw 4000+ instances - the only difference is whether the CPU is locked into a loop whist doing so. There is no extra work for the rendering loop to do - 12 Million triangles is 12 million triangles whenther rendered one at a time in CPU loop, or as an instanced batch. On top of that, my observations show that the extra 4000+ drawcalls for each CPU instance are a lower overall overhead than via instacing with ARB_instanced Arrays or TBO. This surprised me - as I said I’d assumed this to be a perfect case for instancing.

So before you can expect performance improvements, you need your rendering loop to be CPU-bound on state change and draw call overhead.

Instancing is intended to remove state change and draw call overhead when drawing large numbers of objects.

How does instancing improve the situation if the app is cpu state change limited? Maybe I’ve missed something here, but I thought the point of instancing was to avoid state changes by rendering the same object over and over again. This does not usually involve any state changes by definition. Per instance you only want to vary something per object instance as a whole such as position, colour, 3rd texture coordinate or something and avoid the draw function call overhead. What I can see as a possible limitation is some architrectural limit on the number of vertex attribute streams being passed in simulatneously - I currently use 7 with 4 being used to pass the per instance modelview matrix position.

Ilian’s uniform array trick might work, though you don’t get very many uniforms to play with.

I must have missed this hole uniform arrays idea. Can someone enlighten me on this as I know about UBO which came as part of ARB_Uniform_Buffer_object. Did they arrive via EXT_Bindable_Uniform ? Are they part of core (and which)? What limitations are there?

…if they are easy to integrate then I can test straight away…

BionicBytes · May 6, 2010, 2:30am

10 mil poly/frame, 48fps , 30k instances total of (only) ~150 base meshes (encapsulated in VAOs actually). GTX275 msaa4 deferred 720p.

OK, that sound like a lot of instances! But if i do the maths, that means you have ~150 batches which total at 30k instances. That’s actually only 200 instances per batch - so actually its quite small #instances per batch. I found that with IAs and TBO instancing is not worth the effort - so are you using uniform arrays? - these are something I’ve overlooked.

Ilian_Dinev · May 6, 2010, 2:50am

“Uniform arrays” is not some special new object. It’s just this:


uniform vec4 data[512];

//
void main(){
    vec4 v1 = data[gl_InstanceID]; 
}

Access to good ol’ uniforms is immediate in the shader (unlike with the TBOs/etc), upload of them is the fastest RAM->VRAM transfer available, and the total-size limitation forces you to keep data in L1 cache (which the driver then copies quickly in aligned fashion [without memory read-back] to the first FIFO and internal buffers).

You do have to limit the number of instances per glDrawElementsInstanced() call to i.e 150 (for a mat4x3+float) , but it’s a non-problem.
Meanwhile, if you were doing transform-feedback visibility culling, or the per-instance data is big and constant, then just uniform-arrays won’t be enough.

Btw
“30k instances of 150 base-meshes” - some of the meshes were 60k tris, few instances, some were 200-500 tris, 5k instances.

BionicBytes · May 6, 2010, 3:49am

Oh I see.

uniform vec4 data[512];

So how do you allow the uniform array to be any size?
In other words, I won’t know the size of the array until I load the scene data - so how does that fit with the shader having a defined array size during compile?
Is there some sort of 8K boundary for array sizes? Is that why you said keep the DrawInstanced batch count to ~150?
Is that 8K per uniform or per shader?

BionicBytes · May 6, 2010, 3:52am

30k instances of 150 base-meshes" - some of the meshes were 60k tris, few instances, some were 200-500 tris, 5k instances

Good stuff! Did you ever benchmark the instancing benefit versus drawing LODed models instead (non instancing) - thus reducing the vertex count and/or pixel shader instructions (LODed material shader to match model LOD)?

Ilian_Dinev · May 6, 2010, 5:26am

I pick a number i.e MaxInstances=128 for each shader. I currently optimize only for nVidia G80 and GTXxxx, so I try to keep the 16kB register-file full, but empty enough for enough warps to fit (depends on the size of per-warp registers, which is often around 20-80 floats). So, 8kB happens to be a good middle-ground (for GTX, 4k for G80) if I do only 1-3 tex-lookups in the frag-shader.

Let’s say I picked MI=128. If I have only 7 instances, I upload only 7 instances’ data via glUniform4fv. If I have 300 instances, I upload the first 128, call glDrawElementsInstanced, upload the next 128, call glDrawxx, upload the remaining 300-128-128=44 instances, call glDrawxx. Simple . 3 calls instead of 300. Staying in L1 instead of going overboard. Not having to resize any buffers.

[edit]
Unfortunately, I do not do regular LOD yet, just A2C dissolve for foliage; I barely have enough time to model the LOD=0 meshes currently ^^". I had some geomorphing-LOD objects, but I haven’t figured-out a way for mending the discontinuous UVs yet, so I’ll tackle it later. Anyway, when those types of meshes were enumerated in the scenegraph (right after frustum and occlusion culling), they calculate their intLOD and fracLOD, and get grouped by intLOD. Instanced meshes need to have the same NumTriangles and indices, so each LOD group needs its own series of glDrawxx calls.

BionicBytes · May 6, 2010, 6:32am

Thanks for clearing up what you do - that helps.
I think you’ve raised more questions now though!

I currently optimize only for nVidia G80 and GTXxxx, so I try to keep the 16kB register-file full, but empty enough for enough warps to fit (depends on the size of per-warp registers, which is often around 20-80 floats). So, 8kB happens to be a good middle-ground (for GTX, 4k for G80) if I do only 1-3 tex-lookups in the frag-shader.

what 16KB register? How do you know its 16KB - Have you read this somewhere or is it queryable with OpenGL?
per-warp registers? Do you just mean the number of inputs (max uniforms/max attributes) which the h/w supports?
I enumerate h/w capabilities from GL context - see snippet below. Are these what you refer to as 4K for G80 (my GFX card here is nVidia 8600GT)

OpenGL 3.0 Detected
EXT_texture_array:
MAX_ARRAY_TEXTURE_LAYERS: 512
ARB_Framebuffer_object:
MAX_COLOR_ATTACHMENTS: 8
MAX_RENDERBUFFER_SIZE: 8192
MAX_SAMPLES: 16
ARB_Texture_Buffer_Object:
MAX_TEXTURE_BUFFER_SIZE: 134217728
ARB_Uniform_Buffer_Object:
MAX_UNIFORM_BLOCK_SIZE: 65536
MAX_VERTEX_UNIFORM_BLOCKS: 12
MAX_GEOMETRY_UNIFORM_BLOCKS: 12
MAX_FRAGMENT_UNIFORM_BLOCKS: 12
MAX_COMBINED_UNIFORM_BLOCKS: 36
MAX_UNIFORM_BUFFER_BINDINGS: 36
MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS: 200704
MAX_COMBINED_GEOMETRY_UNIFORM_COMPONENTS: 198656
MAX_COMBINED_FRAGMENT_UNIFORM_COMPONENTS: 198656

Don’t quite understand what the 1-3 tex-lookups has to do with anything. Can you expand on this?

Ilian_Dinev · May 6, 2010, 7:43am

It appears the G80 and GTX2xx store uniforms in the 8k/16k regfile instead of the L1-cached constants-memory. I’m not sure at all, though. GL_MAX_VERTEX_UNIFORM_COMPONENTS=4096 on this GTX275, so 16k, which matches the regfile size. (but still, there are frag-uniforms etc which can make the program use more than 16kB constants+registers, so it makes me doubt the previous logic).
GL_MAX_COMBINED_VERTEX_UNIFORM_COMPONENTS=200704 , which is for UBOs - is probably going through the L1, but could be a few cycles slower. (btw you really should try UBOs with your scene)

In GeForces, warps are something like threads; the more threads you can have at once, the better the latency-hiding is. Texture-fetches are high latency, so having more threads is necessary. But the number of possible threads decreases with the number of registers you use.

Anyway, test and tune . Even if max_instances to fit in uniform-arrays were ==2, it’s a cpu-saver. Having it be ==128 is already much more than hoped for :).

BionicBytes · May 6, 2010, 8:05am

many thanks for the update - helpful as ever!

Looks like I’ve got my work cut out then. First I’ll try Uniform arrays - as they’re easy to code in. If I get good results then we’ll see about UBO (I have to alter engine to support them properly).

According to my emumeraton of Radeon 4850, it only has Max Vertex Uniform Components=1024, so I may have to ensure I use #ifdef in shader to accomodate the h/w being used and supply a suitable max array length in either case.