Request: Index indirection.

I made this request before, but I had muddied the idea and since a new spec came out, a new begging:

Basic idea is to have one extra layer of indexing.

New functions:
glEnableIndexAttribute(GLuint attribute_index);

glDisableIndexAttribute(GLuint attribute_index);

glIndexAttributePointer(GLuint attribute_index, GLenum type,
GLsizei stride, const GLvoid*);


When a index attribute is supported the call

glDrawElements(primitive_type, count, index_type, index_ptr)

is equivalent to:

for(int i=0;i<count;++i)
  for(each attribute j with sourcing from array active)
     if(index attribute active for attribute j)
       K=index_attribute_pointer[j][ index_ptr[i] ];
     glVertexAttrib(j, vertex_pointer[j][ K ])


The above nicely extends naturally to all the instancing calls too. Naturally glIndexAttributePointer would source from the buffer object bound to GL_ELEMENT_ARRAY_BUFFER.
State would be a part of vertex array objects too.

One can simulate this behavior with GL_NV_shader_buffer_load, but I would imagine that asking for that to be core is not a good idea (I can dream though).

The use case is not so much to save memory in repeated values (for example render a cube) but for hierarchy UI’s, in theory can also help instancing quite a bit too. We also avoid using up lots of uniform room to hold the values [there are texture buffer objects, but that forces the format of the data, just as doing this via GL_NV_shader_buffer_load does at well].

So if I infer correctly, rather than have:

  • N attribute lists and 1 master index list,
    you have:
  • N attribute lists, N attribute “index” lists, and 1 master index list.

The idea being to (for instance) use vertex position #0 with vertex normal #5 with vertex texcoord #32 all in the same vertex shader execution.

Seems like we should expect this extra “pointer chasing” to incur some overhead.

Don’t think it would totally defeat the vertex cache – could still use the “master index list index” to collapse executions, but since you’re “mixing-and-matching” vertex attribs, there’s likely to be fewer cases where duplicate indices are actually used, so more vertex shader executions (?)

Besides a little space, what is the gain of this approach?

Bet you could simulate this with a texture buffer. I mean, seems you’re already kind of bouncing around the vtx attrib lists more with this, to some degree probably defeating some of the streaming. So texture lookups with the GPU hiding the latency might not perform all that different.

The use case is not so much to save memory in repeated values … but for hierarchy UI’s…

What do you mean?

Seems like we should expect this extra “pointer chasing” to incur some overhead.

Don’t think it would totally defeat the vertex cache – could still use the “master index list index” to collapse executions, but since you’re “mixing-and-matching” vertex attribs, there’s likely to be fewer cases where duplicate indices are actually used, so more vertex shader executions (?)

Definitely more overhead on this, as for the cache business, the post-vertex is “safe”. All that matters there is what was the index passed in glDrawElements. If one uses texture buffer objects one can implement the idea like this:

samplerBuffer vertex_atribute;
in int index_atribute;

v = texelFetch(vertex_atribute, index_atribute)

But I’d imagine that is likely one of the worst way to implement it and awkward in the source code as one will need to have a texture for each attribute, what I really don’t like is using a texture unit for this, as one can imagine using GL_NV_shader_buffer_load is quite direct. Thing is that what I am asking for is much more limited than texture buffer objects and GL_NV_shader_buffer_load.

The use case I have in mind is this: you have got a scene graph with lots of nodes (i.e. transformations) but each node does not have a huge number of triangles… there are three approaches to handle this that I see:

  1. Just say the heck with batching and make a glDrawCall per node.


  1. Do the typical uniform business, but then one needs to break into more calls and additional pain of doing that


  1. put the transformation data on each vertex of each triangle.

(1) and (2) are great that one only needs to update the transformation, but are less than ideal for batching, (3) is horrible to update the transformations.

The use case is a “widget” heavy UI: lots of sub-elements, but each element relatively small and simple to draw.

With the extra layer of indexing, we can view the transformation as an attribute(s) but to move a node, we only affect the transformation.

Tests on embedded devices (like Nokia N900) I saw a pretty big raw performance improvement in doing (3) [on this device there is very limited uniform room and the PowerVR SGX has a fair number of peculiarities to it], but naturally I got hurt in updating transformation data. Doing (2) was great until I have too many UI elements to draw, but due to the limited uniform room, “too many” was really not that many.

The transformation in 2D UI’s can 99% time be packed into 4 floats:
2 floats for one column of a 2x2 matrix whose columns are orthogonal and have the same norm (i.e. rotate and scale) and 2 floats for a translation.

Great thing about using a 3D API for drawing UI is that outside of blending, one can use the depth buffer to draw in any order. There are issues still but you get the idea.

Additionally, if text is drawn as triangles (i.e not as textures) [for example for extreme zoom, but this needs to be done with care to get it to look half way decent] the geometry for each glyph can be stored globally once and to draw the glyphs, the shader has 2 transformations: local to screen and “where the glyph is in local”. As one can see each attribute has a different count: local to screen is per node, “where the glyph is” is per glyph (but each glyph is made up of a different number of triangles). Going the uniform coupled with relatively limited uniform room, means that batching likely gets murdered. One can view this case as a whacked out form of instancing. I also need to freely admit that the idea of storing SVG glyphs into one buffer object needs to be done with a fair amount of care: 65,536 vertices may not be enough to hold a detailed SVG alphabet [that count would be 512 vertices on average a piece for a 128 character alphabet] and that it might be better to just store the glyph vertices as GLubyte which would mean there is no point in this case for my suggestion (but GLubyte is only good for a part in 255, which might not be so great in some cases).

One last bit: this also works into the fixed function pipeline as well… I have to admit I should also post this in the OpenGL ES suggestion place too.

Ah, I see. Sounds alot like ARB_instanced_arrays, where you’ve got some per-instance data that’s applied to each instance rendered in the draw call. But I’m guessing what you need beyond this is for the per-vertex data for each instance to vary (and even the number of verts per instance to vary as well), so that it’s not specifically geometry instancing.

You just want some attribute that, while still streamed, gets stepped along only at specific key vertex indices (those which start a new “node”) that you specify.

For this, while the approach you suggest is one possibility. Should it prove “hard” or undesirable from a driver implementation perspective, I wonder if something like ARB_instanced_arrays but for non-instanced draw calls might work. That is, where you can specify specific vertex attributes that increment only every N vertices. The difference here is that you probably want N to be dynamic within a batch. So we need something in the batch data to “kick” the attribute to increment (like a primitive restart index – a “break the subbatch” directive).

If we had something like that, seems like it would stream better, and handle your use case, without having to have double-level indirection on the vertex attributes.

Thinking about this a bit more, for mere TRIANGLES batches, to implement what I was suggesting, we could just overload the existing restart index for this “break the instance” use case. Whenever encountered, it would just bump the index on the instanced attributes up by one (much like happens automatically at the end of each instance with instanced attributes using ARB_instanced_arrays). For example:

glVertexAttribDivisor( 8, 1 );     // Make 8 a "per-instance" attribute, stepped only when we break an instance (aka subbatch, or "node" in your example)
glPrimitiveRestartIndex( 0xFFFF ); // What we could break a subbatch with
glDrawElements( ... )

So with an TRIANGLES index list of:

0,1,2, 0xFFFF, 3,4,5, 0xFFFF, …

we would render the first TRIANGLE with attr8[ 0 ], and for the second TRIANGLE, we would render it with attr8[ 1 ].

However this doesn’t really make sense for stripped/fanned primitives, where we might want to use the primitive restart index for what it was intended – that is, to break the primitive. This suggests “not” overloading the primitive restart index, but rather creating another magic index value to be used for bumping the per-instance attribute index … say “InstanceBreakIndex”. So for instance:

glVertexAttribDivisor( 8, 1 );     // Make 8 a "per-instance" attribute, stepped only when we break an instance/subbatch
glInstanceBreakIndexEXT( 0xFFFE ); // What we break an instance/subbatch with
glPrimitiveRestartIndex( 0xFFFF ); // What we break a primitive with
glDrawElements( ... )

So with an TRIANGLE_STRIP index list of:

0,1,2,3,4, 0xFFFF, 5,6,7,8,9, 0xFFFF, 0xFFFE, 10,11,12,…

So here we have two strips that are part of the same instance/node, which we break with the primitive restart index. They’re all rendered with attr8[0] (e.g. your transform matrix). Then we see an “instance break” index (0xFFFE), which causes us to bump the index on the per-instance attributes up, so the next strip renders with attr8[1].

Would something like this cover your intended uses?

That should work with the case I have… lets take a gander at the SVG example. Draw the text “oooga-toota-AAGA” into one GL call.

So we’ve got several instances of several different letters, each letter has a different number of triangles. Lets say the shader looks like this:

in vec2 glyph_vertex;
in vec4 screen_transformation_widget;
in vev4 widget_transformation_glyph;

apply_transformation(in vec4 transformation, in vec2 v);

  widget_p=apply_transformation(widget_transformation_glyph, glyph_vertex);

  screen_p=apply_transformation(screen_transformation_widget, widget_p);

  gl_Position=vec4(screen_p.xy, 0. , 1.);

To get this to “fit”, the attribute for widget_transformation_glyph needs to “advance” on each glyph and the attribute for screen_transformation_widget needs to advance for each widget.

So. If we have for each attribute a special index saying to “advance” it (by 1 in this case) and something funny to also say that for a named attribute to NOT use the index specified in glDrawElements at all, I suppose then each attribute will then need a state to say if it is advanced, so the spec would look like:

glEnableAdvanceAttribute(GLuint attribute) enables attribute advancing whose parameters are determined by the functions below (call this boolean ATTRIBUTE_ADVANCE_ENABLED[attribute])

glAttributeAdvanceSize(GLuint attribute, GLsizei count) determines the offset to apply to the named attribute when it’s
AttributeAdvanceIndex is encountered (call this value ATTRIBUTE_ADVANCE_SIZE[attribute])

glAttributeAdvanceIndex{GLuint attribute, GLuint v) sets the attribute index value (call this value ATTRIBUTE_ADVANCE_INDEX_VALUE[attribute])

glDrawElement(type, count, ptr) then becomes:

for( each attribute v)
  //ATTRIBUTE_OFFSET[v] are NOT GL state values, only
  //have meaning within the draw call
for(int j=0;j<count; ++j)
   bool emit;


   for( each attribute v)

     for(each attribute v)
        int I;


        glVertexAttrib**(v, vetex_pointer[v][i]).


this would also work with all other draw calls (some care needs to specify the instancing jazz), but it is kind messy and heck-a-confusing to look at :o . I have to admit that I am not crazy at all about this, it does do what I am after and what I need… so I do not have strong technical reasons to say no, but it looks so horribly confusing and the source code to use it would be quite fragile looking I’d think. Using this spec would save more memory than my original request, but still looking at the above makes me quite hesitant.

No idea if this is harder or easier for a driver, though it might be as the when that index comes, it becomes a signal to increment a pointer.