Request: multiple index streams.

I have to freely admit that this may not be a good idea from an implementor’s point of view, but for using GL to create UI’s it would be heck-a-nice:

Rather than using a common index stream for all vertex attributes, have the ability to have a separate index stream for each attribute, something like this kind of API:

glVertexAttribPointer --> sets vertex attribute pointer as before

glVertexIndexPointer --> sets the index array

glDrawPerAttribIndexElement(GLenum type [i.e GL_TRIANGLES, etc], first, count);

for example:


for(int i=0, i<number_attributes_to_use; ++i)
{
  glVertexAttribPointer(i, number_components[i], attribute_values_type[i], attribute_values_normalized[i], attribute_values_stride[i], attribute_values_pointer[i]); 
 
  glVertexIndexPointer(i, attribute_index_type[i], attribute_index_stride[i], attribute_index_pointer[i]);
}

glDrawPerAttribIndexElement(GL_TRIANGLES, 0, number_points);

is equivalent to:


glBegin(GL_TRIANGLES);
for(int v=0; v<number_points; ++v)
{
  for(int i=number_attributes_to_use-1; i>=0; --i)
  {
     int I;

     I=attribute_index_pointer[i][v];
     glVertexAttribute[??](i,  attribute_values_pointer[i][i] );
  }
}
glEnd();


where [??] corresponds the correct suffix for the type and number of elements.

Edit: emulating this behavior is possible by using texture buffer objects and having the attributes be the index and having the vertex shader fetch from the texture buffer objects, also one can do this via GL_NV_shader_buffer_load, but I’d imagine both carry an unnecessary overhead if it can be done more directly.

2nd Edit: Additionally, the suggestion naturally would have a methodology for using buffer objects of the index stream [likely glVertexIndexPointer would use a vertex buffer object if something is bound to GL_INDEX_ARRAY_BUFFER]. Additionally, logical working with the instancing glDraw calls too.

What does this have to do with creating UIs?

Interesting, it would seem to fit pretty well into current OpenGL. So you would have something like this:


   glBindVertexArray(vao);
   glBindBuffer(GL_ARRAY_BUFFER, posBuffer);
   glVertexAttribPointer(0, 4, GL_FLOAT, GL_FALSE, 0, 0);
   glBindBuffer(GL_ARRAY_BUFFER, normalBuffer);
   glVertexAttribPointer(1, 4, GL_FLOAT, GL_FALSE, 0, 0);

   glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, posIndexBuffer);
   glVertexIndexPointer(0, GL_UNSIGNED_BYTE, 0, 0);
   glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, normalIndexBuffer);
   glVertexIndexPointer(1, GL_UNSIGNED_BYTE, 0, 0);
   // can bind another index buffer if we intend to also draw
   // using normal calls
   glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, someOtherIndexBuffer);

   // better name than glDrawPerAttribIndexElement?
   glDrawIndexedArrays(GL_TRIANGLES, 0, number_points);

This would mean in the case of a cube, using triangles + floats, you could have:

positions: 8 x 3 x 4 bytes
normals: 6 x 3 x 4 bytes
position indices: 36 bytes
normal indices: 36 bytes
total = 240 bytes

instead of the current:
positions: 24 x 3 x 4 bytes
normals: 24 x 3 x 4 bytes
indices: 36 bytes
total = 612 bytes

If done right, it could also mean much less calculating of repeated values (or no need to pre-calculate if values are repeated if this is a current optimization used by drivers)

You could also have glDrawIndexedArraysInstanced, glDrawIndexedArraysIndirect with the same arguments as
glDrawArraysInstanced, glDrawArraysIndirect.

Implementing indexed versions of the glDrawElements## draw calls might be more complex though, you may need to provide a list of vertex attribute indices as well as the indices:


void glDrawIndexedElements( enum mode, sizei count, enum type, uint attribCount, const uint *vertexAttribIndices, const void **indices );
void glDrawIndexedElementsInstanced( enum mode, sizei count, enum type, uint attribCount, const uint *vertexAttribIndices, const void **indices, sizei primcount );


attribIndices = {0, 1};
positionIndices = {0, 1, 2, 1, 2, 3, 4, 5, 6};
normalIndices = {0, 0, 0, 0, 0, 0, 1, 1, 1};
Indices = {positionIndices, normalIndices}; 
           

glDrawIndexedElements(GL_TRIANGLES, 9, GL_UNSIGNED_BYTE, 2, attribIndices, Indices);
glDrawIndexedElementsInstanced( GL_TRIANGLES, 9, GL_UNSIGNED_BYTE, 2, attribIndices, Indices, 10);

The only bad part of my suggestion is that it kills the post vertex shader cache. Ah sighs.

This would mean in the case of a cube

How many cubes do you render in any serious rendering scene?

That’s generally the primary argument against this sort of thing. For demos and other light rendering tasks (ie: things that don’t care about performance), this would help. For a serious scene, there is a lot less vertex duplication, to the point where additional index arrays only serve to increase the memory footprint overall, rather than decreasing it.

The motivation for this is sort of to avoid duplication.

Here is where it useful for a UI. You’ve got a collection of UI elements, in a scene graph hierarchy thing. Lets say a 10-20 thousand triangles from UI elements (text, SVG, etc). Now there might be as few as say just 50-200 or so transformation’s floating around. So the first go for out vertex shader is this:



uniform mat4 transformation[LOTS];

in vec4 position;
in int index;

main(void)
{
  gl_Position=transformation[index]*position;
}

naturally, LOTS can’t be that big, you can likely assume that the transformation are pure 2D, so the mat4 can actually be stored as a pair of vec2’s [first column of a orthonormal 2x2 matrix and a translation], but still it limits the number of transformations badly (and really badly in the GLES hardware world). So now we have a few choices:

(1) break into chunks handling elements from only so many distinct transformations at a time

OR

(2) store the transformation as part of the attributes

But we see that (1) kills batching, and (2) just sucks.
By having the index attribute separately we can minimize the draw call, and the vertex shader looks like the transformation is stored per vertex. Now, one can get similar functionality vie texture buffer objects and having one of the attributes be an integer or use GL_NV_shader_buffer_load and have an attribute be an integer. When we get into rendering SVG stuff [for example when zoomed in so much that reallying on a texture looks icky or takes too much memory], then this idea is even more beneficial… we can choose just to store the triangles of the glyphs once and the position stream takes from those values, etc…

I freely admit that my use case is not really desktop.

but still it limits the number of transformations badly

GL 3.x hardware can provide 4096 uniforms via UBO. I’m pretty sure you don’t have 4096 separate transforms in your rendering. And if you do, there’s always buffer textures.

By having the index attribute separately we can minimize the draw call, and the vertex shader looks like the transformation is stored per vertex.

It’s not like OpenGL is preventing your access to something that you would otherwise be able to do. The cold hard reality is that hardware currently simply cannot do that (or at least, no faster than you could if you did things manually). There is no hardware facility for adding multiple element arrays and indexing each attribute array with a separate element array. It does not exist.

So the best you could hope for would be for future hardware. Of course, future hardware will also be faster hardware, thus lessening the need for this.

Sighs, only Alfonse.

At any rate, I guess you should read it again Alfonse: GL_NV_shader_buffer_load does provide a way to do this, as do texture buffer object. In both cases, one loses the stride thing. The point is, since some hardware can do it already, then perhaps there is a way to do it better.

Also 4094 uniforms --> 2048 vec2’s, which means 1024 transformations, and guess what you can easily have much more than this! What happens if text is represented via SVG [i.e. paths?], unless you replicate the vertex data for each letter instance, then you can easily go over 1024 transformations.

Additionally, likely elements for a useful GUI will need to have 2 transformations on them, a transformation for screen to “widget” and between “widget” and element. This way moving a widget is very cheap operation in terms of memory touches. Composing 2 orthogonal rotations is same as multiplying 2 complex numbers to that is reasonable to do in a vertex shader.

Lastly, GLES hardware typically works differently than desktop hardware and is heck a lot slower, again this is mentioned in the post too.

In both cases, one loses the stride thing.

Unless you pass the stride to the shader.

The point is, since some hardware can do it already, then perhaps there is a way to do it better.

Is there some reason to expect this to be possible?

Don’t make the mistake of thinking that you’re the first person to ask for this. People have been asking for multiple indexed rendering for more than a decade. Since before there was OpenGL ES, FBOs, shaders, or buffer objects, people have wanted this. Over a decade of hardware has come and gone.

The arguments are always the same: it makes data smaller, it let’s me pass uniforms as attributes, etc. And yet, not once in that decade+ has any hardware maker looked at these arguments and said that one of them was legitimate enough feature to make hardware changes for.

Also 4094 uniforms –> 2048 vec2’s, which means 1024 transformations

4096 uniforms is 4096 vec4s, each of which can be interpreted as 2 vec2’s, which means 4096 transformations. Or 2048 with the new accounting method you created.

I would also point out that you get up to 8 of these 4096 element arrays.

What happens if text is represented via SVG [i.e. paths?], unless you replicate the vertex data for each letter instance, then you can easily go over 1024 transformations.

I’m confused. If each letter is defined by one or more paths, and you don’t want to copy these paths when you use them, then these paths clearly must be in their own section of the buffer object. A section that is not the same as the one where other vertex data comes from. So how would you be able to render them in the same draw call regardless of how you get the transformation there? Wouldn’t you need to make some gl*Pointer calls?

Lastly, GLES hardware typically works differently than desktop hardware and is heck a lot slower, again this is mentioned in the post too.

Slower hardware tends to be less likely to be CPU bound, since the GPU spends more of its time doing actual work. Also, GLES hardware is even less likely to be able to do this.

More importantly, OpenGL ES is not OpenGL. They may have similarities, but one does not control the other. OpenGL should not add a feature that is only useful for GLES; that’s why GLES exists: so that there can be differences where necessary between desktop and non-desktop rendering systems.

There’s also an issue you seem to be missing with UIs. Namely that not everything is triangles. You often get lines as well. That’s a new kind of primitive to draw, so its a new batch. And since you will often need to draw things in a specific order (unless you plan on sending 3-vector positions with a Z coordinate), you will often need to draw some triangles, draw some lines, then draw other triangles, etc.

Sighs, only Alfonse.

Ok, lets take a use case. You have a collection of SVG glyphs all stored in ONE buffer object. You have a hierarchy of widgets. You have widgets using these glyphs in different places within it.


in vec2 raw_glyph_position;
in vec4 screen_transformation_widget; //.xy rotation, .zw translation

in vec4 widget_transformation_glyph; // .xy rotation, .zw translation;
in float z_ordering; //use depth testing to get front to back right

void
main(void)
{
  vec2 widget_position, screen_position;

  widget_position=apply_transformation( widget_transformation_glyph, raw_glyph_position);

  screen_position=apply_transformation( screen_transformation_widget, widget_position);

  gl_Position=vec4( to_normalized_coordinates(screen_position), z_ordering, 1.0);
}


Now we have several buffer objects: a global one for transformations from widget to screen, a large buffer object holding all “raw_glyph_position” for all bits across all widgets and a several index buffer objects for the separate index streams. With this, the drawing of all such SVG stuff of all widgets can be executed with ONE draw call, which is the point here. Otherwise one needs to break it down to many. Updating a widget position on a screen just touches one transformation value. Now, the current hardware can do this already if the shader looks like this:


uniform vec2 *raw_glyph_position;
uniform vec4 *screen_transformation_widget; //.xy rotation, .zw translation, think of this as the instance data for a fixed glyph.

uniform vec4 *widget_transformation_glyph; // .xy rotation, .zw translation

in int raw_glyph_position_index;
in int screen_transformation_widget_index;
in int widget_transformation_glyph;


uniform float *z_ordering; //use depth testing to get front to back right, z-ordering values are stored per-glyph instance as well.
in int z_ordering_index;

void
main(void)
{
  vec2 widget_position, screen_position;

  widget_position=apply_transformation( widget_transformation_glyph[widget_transformation_glyph_index], raw_glyph_position[raw_glyph_position_index]);

  screen_position=apply_transformation( screen_transformation_widget[screen_transformation_widget_index], widget_position);

  gl_Position=vec4( to_normalized_coordinates(screen_position), z_ordering[z_ordering_index], 1.0);
}


Now, looking at the above more closely, one sees what I am really asking for: a layer of index-indirection between the indices passed in glDraw calls and the indices used in the shader. Also, the above clearly demonstrates that previous generation hardware can do it. The issue is if there is direct support for it within the hardware, then the shader gets simpler, and likely the performance a touch better.

Now some bile directed at Alfonse, because he makes me sigh so much:

Slower hardware tends to be less likely to be CPU bound, since the GPU spends more of its time doing actual work. Also, GLES hardware is even less likely to be able to do this.

If you gave a real thought about this, the reason was to cut down on draw calls, which means cutting down on CPU usage. Exactly the point. As for weather or not GLES hardware can do this, do you even really know Alfonse? Have you seen driver code? Shoot: Imagination Technologies Power VR SGX 545 is to have GL3.2, and that GPU is targeted for low power consumption.

More importantly, OpenGL ES is not OpenGL. They may have similarities, but one does not control the other. OpenGL should not add a feature that is only useful for GLES; that’s why GLES exists: so that there can be differences where necessary between desktop and non-desktop rendering systems.

Pants, considering that it’s the same organization making the two specs, the GLES specs are mostly just stripped down OpenGL specs and their evolution do influence each other.

There’s also an issue you seem to be missing with UIs. Namely that not everything is triangles. You often get lines as well. That’s a new kind of primitive to draw, so its a new batch. And since you will often need to draw things in a specific order (unless you plan on sending 3-vector positions with a Z coordinate), you will often need to draw some triangles, draw some lines, then draw other triangles, etc.

Using the depth buffer to get back to front correct is a really, really BIG freaking point for drawing the UI, drawing front to back and using the depth buffer for it’s magic has the obvious benefits of reducing fill rate and reducing function calls. My personal opinion on GL_LINES for UI is to not use them; the rasterization rules for when the lines are not screen aligned are loose as such the UI will look a little off as it moves from hardware to hardware platform, but even if you do use GL_LINES, then we have a draw call for each texture (atlas) (GL_TRIANGLES), and a draw call for each line width.

4096 uniforms is 4096 vec4s, each of which can be interpreted as 2 vec2’s, which means 4096 transformations. Or 2048 with the new accounting method you created.

My bad that. Using one transformation per widget and each element has a transformation from it to the widget leaves me with 2048 transformations, which is drum roll please: 2048 elements… admittedly we could say the UBO is for storing just the transformation from screen to widget and have the elements store themselves the transformation from element to widget, but then we still need to replicate the data. If the elements are to store an index into an array, we then need that many indices… still sucks.

Rather make this an edit, but I guess cannot edit on older posts:

One last bit, with the thinking in mind that the proposal is just an extra index-indirection, one sees that it does not thrash the vertex cache if we change the proposal a touch to:

add entry points:

glEnableVertexIndex(GLuint) --> enables index stream on named attribute
glDisableVertexIndex(GLuint) --> disables index stream on named attribute

glVertexIndexPointer(GLuint attribute, GLenum type, GLsizei stride, const void *pointer)

sets the vertex index pointer for the named attribute.

If the index stream is active for an attribute A, rather than fetching the I’th value of the vertex pointer for A is uses P’th value of the vertex pointer for A, where P is the I’th value of the index pointer for attribute A.

This request spec is cleaner and easier than my firs request, only 3 API points added and easily fits in.