is there any chance for true instancing?

first, i must say, that i read all threads here about instancing, also i read nVidia’s GLSL pseudo-instancig article.

i tried to implement pseudo instancig, but in my app i don’t use modelview matrix (it is always identity matrix) and in shaders i just multiply by projection matrix (no difference in performance when multiplying with mvp matrix).

i also implemented VBO, which gave no performance boost. in nVidia’s pseudo instancing sdk example the VBO is slower then vertex arrays on my pc! (a64 3000+, 512mb ram, gf6600gt 128mb pci-e, 78.01, xp sp2)

my problem is that, that in some situations i call the glDrawElemens up to 200k times per frame. i am rendering line-strips (2-16 vertices, rarely less then 4 and more then 16), vertices are always the same, indices are always the same, the only thing that changes is one uniform (now it is attribute, which give a very little performance boost) and a vertex attribute pointer. code is like this:

 for (unsigned int i = 0; i < mNumOfRecords; ++i) {
    index = i * (mDimension + 1);

    glVertexAttrib1fARB(5, mData[index]);
    glVertexAttribPointerARB(1, 1, GL_FLOAT, GL_FALSE, 0, BUFFER_OFFSET(sizeof(float) * (index + 1)));

    glDrawElements(GL_LINE_STRIP, count, GL_UNSIGNED_INT, 0);

glVertexAttrib1fARB(…) and glVertexAttribPointerARB(…) can be packed into one call to glVertexAttribPointerARB(…), but that will double the memory and that is not good and i think it will give no boost to performance.

i think that this could be rendered with instancig with one function call and the app will be “flying”. so, will there ever be true instancig in OpenGL? it will help me (not only me) a lot. or do you have any suggestions for me how to make this faster?

thank you

ps: one last thing. there is no difference in performance in my app between immediate mode, vertex arrays and VBOs. when using VAs and VBOs i use shaders, in IM everything is on CPU. and also there is no difference between asm and glsl shaders.

I think that the problem with performance is that 200k calls adds just too much overhead in the driver … you should try to get the batches bigger .
Maybe it would help if you grouped , lets say 8 strips together make 8 attribs and add some per vertex element that will select the right attribute .
That way you might be able to lower the setup overhead …

A simple question a bit related to this topic:

Are vertex attributes like other normals arrays (vertex,normals…) regarding indexing ?

If so, I think using indexed vertex attributes pointers with bigger arrays will help. You’ll have a little more memory consumption due to the indexes, but as you’ll have very fewer calls to glDrawElements, that will surely help rendering faster.

There’s an article on this in GPG5.

The trick is to think of it as a skinning problem; You duplicate the mesh many times into one buffer, but each duplicated instance gets a bone index into your constant registers. Then you can load up different matrices by simply setting different matrices to the different constant registers.

You’re gonna run out of constant registers - there’s a trick in the GPG to compress the matrices… Even without it, you can still reduce your draw calls by x50

Hi shelll

there is no difference in performance in my app between immediate mode, vertex arrays and VBOs.
Well, it’s true that 200k batching calls is an impressive number but are you sure you’re trying to optimize for the right bottleneck ? To me it sounds like if the way you submit primitives doesn’t change the fps, then your application is not really cpu limited. There is no real instancing in OpenGL because it’s not really needed, as calls are very lightweight compared to d3d ones. if it is really a problem, then you have to group your commands in bigger batches. have you tried GL_EXT_multi_draw_arrays?

just some thoughts, regards,

  • julien

i don’t understand, can you post link to that paper please?

i can pack that one lVertexAttrib1fARB() cal into glVertexAttribPointerARB() for cost of doubling the amount of data, but i don’t know how to effectively reduce the number of draw calls. those vertex attributes is the only thing that changes for every line-strip. there is constant vertex pointer and one constat vertex attrib pointer and also the indices are always the same. i think that preparing the data for rendering in larger batches e.g. packing some things together will be expensive and there will be no performance gain… will try it if there is other way. and the last thing will be implemenitg it in d3d with instancing (i have never seen d3d)…

i implemented GL_EXT_multi_draw_arrays in half an hour, but i am unable to change per primitive vertex attrib pointer, so it just renders many line-strips one over other, so there is just one line. this approach is around 20% faster (i think that it will be faster when there will be no overdraw).
i know that the reason for no instancig (for now) in GL is that GL draw calls are much much cheaper then the d3d ones, but in my situation, instancig will help a lot.

Originally posted by shelll:
i don’t understand, can you post link to that paper please?

It’s not a paper. It’s a book. You buy it :wink:

Full title is Game Programming Gems 5.

Here is the deal:
Let’s assume your data is just quads - no indices, no nothing. The verts would look like this:
(1,0,0) (1,1,0) (0,1,0) (0,0,0)

this will draw a quad. Now we’ll add one more float, which is the
instance number:
(1,0,0, 0) (1,1,0, 0) (0,1,0, 0) (0,0,0, 0)

We duplicate the verts, bumping up the instance number for each version - this is essentially a matrix (or bone) index:
(1,0,0, 0) (1,1,0, 0) (0,1,0, 0) (0,0,0, 0)
(1,0,0, 1) (1,1,0, 1) (0,1,0, 1) (0,0,0, 1)
(1,0,0, 2) (1,1,0, 2) (0,1,0, 2) (0,0,0, 2)

now, you can load up a lot of matrices (like in skinning), and draw 3 objects in one call. You just submit all 12 verts at once, and the matrix indexer will move each instance around.

The reason this is clever is that it keeps draw calls down. To move a quad, you can just change the matrix. given the verts and you want to render 8 quads, your opertaion is:
Load Matrix 0 into arb_matrix 0
Load Matrix 1 into arb_matrix 1
Load Matrix 2 into arb_matrix 2
Draw All 12 verts
Load Matrix 3 into arb_matrix 0
Load Matrix 4 into arb_matrix 1
Load Matrix 5 into arb_matrix 2
Draw All 12 verts
Load Matrix 6 into arb_matrix 0
Load Matrix 7 into arb_matrix 1
Draw Only first 8 verts.

Sure, it’s a bit of work to duplicate the verts, but that is a oneshot process. Do it at load.

san u post a screenshot of what youre doing
u may get some suggestions on how to minimize the 200,000 draw calls

also (im not sure about this ) but vertex texture shaders can create vertices i believe?, useful for displacement mapping etc
if thats the case then u can just chuck all your instancing stuff in a texture 1024x1024 size can hold lots of info
only available on >= nv4x

i know what gpg 5 is :slight_smile: i just thought, that there could be a separate paper about it. i was thinking about a differrent packing, but both with double my data size, because of that one little attribute. maybe i will give it a try.

i am rendering parallel coordinates (google will find tons of images, for examle par-coords ). i don’t think, that vertex shader can create vertices. this night (i just woke up, it is 8:15am in europe) i was dreaming :slight_smile: about packing this into a texture (as i sad before, i have gf6600gt), but it was only a dream :slight_smile: but when i’ll return from work a will think of that deeeper. but i still think it will double or triple my data, but i see, that it is the only way to go in ogl…

im not 100% sure what u want to draw (a graph?)
have u tried stiching together various draw calls with degenerants?

i am rendering those poly lines.

parallel coordinates is one type of visualizing n-dimensional data. each axis represents one dimension (e.g. cars have weight, horsepower, acceleration, cylinders, MPG…) and each line represents one n-dimensional record (in our example it is one car :slight_smile: ). each record has its own degree of interest (DOI) (float from 0.0 to 1.0), which sets the color of line (that is that one attribute passed for every line strip).

axises can have arbitrary order and also can be removed.

every dimension has its range, mostly floating point numbers.

i am rendering it like this: as vertices i am sending the x-coordinates 0, 1, 2… according to order of dimensions (to them i can pack the DOI without doubling my data :smiley: , but there is some preprocessing for every batch at render time). then i am sendig the dimensions ranges as vertex attributes and also as vertex attributes i am sending that records data (weight, acceleration, HP, MPG…) and from that data i calculate the y-coordinate in the shader. i am always using the same indices, for ranges, data, vertices, for every linestrip.

i am thinkig of packing 4-8 (16?) line strips to one call at price of some preprocessing at render time per batch.

i implemented the packing a can dynamically change it on the fly. i tried packing 8, 16, 32 strips to one call. the preprocessing on 150k records takes about 5-10ms, so it is “free”, but there is no performance gain overall in the rendering.

i think i am rasterization bound, so only card with better lines acceleration will help. in 2-3 days i will implement VBO (i do it in my free time and don’t when i will find an hour for that) for this packing. but i don’t think that it will bring any performance. but we will see :slight_smile: