feature request: indexing per attribute


Right now (from my understanding, correct me if I’m wrong), array indexing assumes all the attributes are defined per vertex. It is suitable for a mesh with no discontinuities (for example, at a given position, the normal is unique and the texture coordinate is unique).

As soon as there is a discontinuity in any attribute, all the attributes and positions have to be duplicated. For example, if a vertex is shared by two primitives with different normals but has a unique texture coordinate, the positions and texture coordinates have to be duplicated.

This is a really big issue with huge meshes.

Adding indexing per attribute would solve this memory footprint problem.

It could also allow fast switching between normal per vertex and normal per face visualization of a mesh without much memory footprint.

Let’s assume we have a mesh of triangles with texture coordinates per vertex positions, face normals and vertex normals.

The required arrays (of data) will be:

  1. an array for the vertex positions
  2. an array for the texture coordinates
  3. an array for the face normals
    4, an array for the vertex normals

Case 1. To visualize the mesh with normal per vertex, you will need:
the same index array for the vertex positions, texture coordinates and normals per vertex. That’s what happens right now with the single index array.

Case 2: To visualize the mesh with normal per face, you will need:
the index array used above for the vertex positions and texture coordinates and an additional index array on the face normals

The index array on face normals will look like:

0 0 0 1 1 1 2 2 2

It will be the only array with duplication. In addition, duplication is necessary only on indexes not an normal coordinates.

Admittedly I am not at all neutral in this, being a big fan of it, but there is an extension that you can use: GL_NV_shader_buffer_load, which allows for you to do exactly what you are asking, and is more general. The horrible big catch is that it is nVidia only, and not lot of hope of it becoming an EXT extension much less ARB or to be adopted into the spec.

(captain obvious)
A) It’s easily possible via geom-shaders and buffers. Just sub-optimal.
2) You can have per-face normals… by recalculating them inside the frag-shader:

vec3 ilTriangleNormal(in vec3 varPos){
	return normalize(cross(dFdy(varPos),dFdx(varPos)));

Without passing any normals, tangents/bitangents, one can have normal-mapping with such code ^^ .

“Without passing any normals, tangents/bitangents, one can have normal-mapping with such code”

You mean dFdx as Tangent and dFdy as Bitangent ??

Does that really work properly? Is the precision any good? When you have a smoothed normal, can you use it to smooth tangent and bitangent in the shader too?

And what about performance? Does using the derivatives incur any overhead?

That would be a nifty trick that could save me lots of bandwidth.


There is one nasty bit about using dFdx and dFdy to calculate normals, tangents and biTangnets (having done it myself): it will be horribly faceted. There is another part that is ungood: you will be computing the normal, tangent and biTangent per fragment, so even if you wanted faceted values, you are recomputing it per fragment. On the other hand if you use a geometry shader then you can get that per primitive calculation done, additionally, you can even get some smoothing (at cost of additional calculation) if you use the neighbor aware input for the geometry shader (the smoothing would be to the 3 neighbor triangles, which is typically not the same as most smoothing is done).

Yep, with compute kicking into high gear I’d be tempted to run the vertex pile through a T&L(&Clip&Amplify) compute kernel, then send the whole shooting match through to the VS for bindless indexing, and relegate the incoming attribute stream to a channel for the various bindless attribute indices. Talk about your bandwidth savings days, and just in time for Christmas. :wink:

//=================[[ Create tangent and binormal ]]==========================================[
// gotten from http://www.pouet.net/topic.php?which=6266
// Use like this:
//		mat3 TBN = computeTangentFrame(gl_Normal, varPos,texCoord);                
//		normal = normalize(TBN * normalFromTexture); 

mat3 computeTangentFrame(vec3 normal, vec3 position, vec2 texCoord){
    vec3 dpx = dFdx(position);
    vec3 dpy = dFdy(position);
    vec2 dtx = dFdx(texCoord);
    vec2 dty = dFdy(texCoord);
    vec3 tangent = normalize(dpx * dty.t - dpy * dtx.t +0.001);
	vec3 binormal = normalize(-dpx * dty.s + dpy * dtx.s + 0.001);
    return mat3(tangent, binormal, normal);

I haven’t optimized it yet, but is fast enough for me.
I don’t notice blockyness when the normal-texture is not blocky (RGB8):
When it’s a DXT1, of course this is expected:
Both screenshots are at:

  • CSAA 16x4, GTX275;
  • 4 dynamic specular lights, a bunch of trig-algebra in the FS, albedo disabled;
  • active unoptimized mirror+glGenMipmaps;
  • generally stays above 700fps but right now I’m running other heavy software in parallel.
  • ZCull passes

Of course, having the TBN in the vtx-shader is better - this auto-TBN in the frag-shader is just nifty when you can’t have the TBN beforehand. I bet it’ll also save some bandwidth if geom in frustum is 1+ mil tris.

P.S removing the TBN calc makes the app run at 590fps instead of 580fps

To kRogue:

I will look at this new nVidia extension, thanks.

To Ilian Dinev:

I just pick normals for illustration purpose.

An other example would be to visualize a scalar value constant per primitive. you want to send the scalar as a 1D texture coordinate and use a 1D texture to map the scalar value to a RGB color (while using normals per vertex).


But the blockiness comes from the blocky normal-map, the calculations don’t change. When you said the result is “horribly faceted” in the previous post, i thought this might come from the GPUs 4x4 pixel-block processing, but it makes sense that that should not introduce any problems.

I intended to put the TBN calculation into the GS for a long time, but so far i never had time to try it.


“flat” input option :slight_smile: . It can still require some vertex-unwelding to be done in offline preprocess, but not as much as before.
Also, there’s gl_PrimitiveID. Anything else that isn’t geom-shader based needs explicit unwelding by disabling the vertex-cache - something not supported on half the GL3 cards, I guess.

I can see how the “flat” qualifier in the output of a vertex or geometry shader can allow combining normals per vertex and texture coordinate per primitive. (ref: OpenGL 3.2 spec page 96/GLSL 1.50.09 pages 29 and 42)

However, I really don’t get how it would save any memory footprint at all. Can you elaborate, please?

When the tri-winding is clockwise and the “flat” modifier selects the first vertex’s data, this is an example mesh:

9 verts, 7 tris ^^

Right-click ->View Image in Firefox, it’s 1024x1024

To Illian Dinev:

Dealing with the flat qualifier requires to add the constraint that each vertex will be attached to a unique face. Starting from a mesh without this constraint (for example a mesh coming from a file) requires to write a non-trivial algorithm to solve this issue (re-ordering indices). In addition this algorithm has a computational overhead and may not have a trivial solution.

Imaging the mesh is just a tetrahedron (so, a closed mesh).

On the image, vertex 0 is on left-side, vertex 1 is in the front, vertex 2 is on the back and vertex 3 on the top.

Let’s start with the triangles described by a list of vertices in counter clockwise order and for which the first vertex is the one with lowest index.

face 0: 0 1 3 (front face)
face 1: 1 2 3 (right face)
face 2: 0 3 2 (back face)
face 3: 0 2 1 (bottom face)

Now, in order to use flat modifier with the provoking vertex being the first vertex, I have to rotate some of the lists.

Let’s assume the following algorithm: iterate over faces, pick the available vertex with lowest id still not used by any previous faces. At the beginning all vertices are marked unused.

face 0: pick vertex 0. mark X on faces where vertex 0 cannot be used anymore as the first vertex:

face 0: 0 1 3
face 1: 1 2 3
face 2: X 3 2
face 3: X 2 1

face 1: pick vertex 1.

face 0: 0 1 3
face 1: 1 2 3
face 2: X 3 2
face 3: X 2 X

face 2: pick vertex 2, re-order the face list to be

face 2: 2 0 3

face 0: 0 1 3
face 1: 1 2 3
face 2: 2 0 3
face 3: X X X

face 3: Ooops, no more vertex available to be a good candidate as the first vertex.

Yes, I could have picked vertex 3 for face 2, then vertex 2 would be available for face 3, with another kind of algorithm. But the point is, this processing is not trivial and not guaranteed to succeed.

Uhm, I did mention that some unwelding will still be necessary, but that it’s far not as much as NumVerts=NumTris*3.
The algorithm to do a non-optimized version of synthesizing such an ordered+semiunwelded mesh is not rocket science (mark first available, if not available then unweld). The maximum-optimized version could be a complex greedy-algo, but looking at real-case models, the optimization isn’t worth it at all.

I already deal with vertex-unwelding all the time in my offline preprocessing tools, so it doesn’t look like rocket science to me.

Ookay, as the thing is quite easy to make in basic form, I wrote the code to do the reordering+unwelding.
Have fun:

void* Flattenize(int NumVerts,const void* VertsData,int VtxSize,		
				int NumTris,int* inoutTriVtxIndices,const void* TrisData,int TriSize,
				int &outNumVerts,int &outVtxSize // write-only, updated on success

	struct FVTX{
		const char* var_data;
		const char* flat_data;

	if(NumVerts<=0 || NumTris<=0)return null;
	FVTX* verts = new FVTX[NumTris>NumVerts ?  NumTris*3 : NumVerts*3]; // worst-case size, including precaution if some verts are not being used originally
	int FNumNewVerts = NumVerts;
	const char* inVertsData2 = (const char*)VertsData; // utility pointering
	const char* inTrisData2  = (const char*)TrisData;  // utility
	int i;
		verts[i].var_data = &inVertsData2[i*VtxSize];

	int* pVtxIndex = inoutTriVtxIndices; // to allow easy change to per-primitive-numVerts
		int FirstOpenSlot = -1;
		int* pTriVtxIndex = pVtxIndex;		pVtxIndex+=3;
		const char* triFlatData = &inTrisData2[i*TriSize];

		for(int j=0;j<3;j++){
		if(FirstOpenSlot==-1){ // uhoh, unweld first vertex
			int oldIdx = pTriVtxIndex[0];
			int newIdx = FNumNewVerts++;
			pTriVtxIndex[0] = newIdx;
			verts[newIdx].var_data = verts[oldIdx].var_data;
			verts[newIdx].flat_data = triFlatData;
			//-----[ note: there is a possibly easy optimization: ]----------[
			//	instead of blindly picking the first vertex, smartly pick
			//	the most suitable one. The most suitable one is the one
			//	that is used in the least number of triangles.
			//	Thus, have the FVTX struct keep track of how many times 
			//  it has been used. 

			while(FirstOpenSlot--){ // rotate indices
				int fstidx = pTriVtxIndex[0];
				pTriVtxIndex[0] = pTriVtxIndex[1];
				pTriVtxIndex[1] = pTriVtxIndex[2];
				pTriVtxIndex[2] = fstidx;
			verts[pTriVtxIndex[0]].flat_data = triFlatData;			

	//--------[ now compose result verts array ]------------------[
	int NewVtxSize = VtxSize + TriSize;
	char* pResult = new char[FNumNewVerts*NewVtxSize]; 

	char* edi=pResult;
		memcpy(edi,verts[i].var_data,VtxSize); edi+= VtxSize;
		edi+= TriSize;
	delete[] verts;

	outNumVerts = FNumNewVerts;
	outVtxSize  = NewVtxSize;
	return pResult;

Now, your test-case:

// ************************* TESTING AREA. WEAR HELMETS ***************************************

struct VVV1{
	int test1;
	float pos[3];
	float norm[3];
	float data1[7];
struct TTT1{
	int test2;
	float clr[3];

VVV1 origVerts[4];
TTT1 origTris[4];

int origIndices[4*3] = {
	0, 1, 3,
	1, 2, 3,
	2, 0, 3, // ****** I had to reorder this to match your problem-scenario
	0, 2, 1

struct RES_VTX{ // resulting vertex
	VVV1 v; // varying data
	TTT1 t; // flat triangle data

void DO_FLATTENIZE(){ // main() for test
	int resultNumVerts,resultVtxSize;

	for(int i=0;i<4;i++){
		origVerts[i].test1 = i + 100;
		origTris[i].test2 = i + 10000;

	RES_VTX* data1 = (RES_VTX*)Flattenize(4,origVerts,sizeof(VVV1),4,origIndices,origTris,sizeof(TTT1),resultNumVerts,resultVtxSize);


	for(int i=0;i<resultNumVerts;i++){
		trace("vtx[%d]= %d ,  %d",i,data1[i].v.test1,data1[i].t.test2);

	for(int i=0;i<4;i++){
		trace("   tri %d:	%d, %d, %d",i,origIndices[i*3+0],origIndices[i*3+1],origIndices[i*3+2]);


And the debugging output for it:

resultNumVerts = 5
resultVtxSize = 72
vtx[0]= 100 ,  10000
vtx[1]= 101 ,  10001
vtx[2]= 102 ,  10002
vtx[3]= 103 ,  0  ***** NOTICE
vtx[4]= 100 ,  10003
   tri 0:	0, 1, 3
   tri 1:	1, 2, 3
   tri 2:	2, 0, 3
   tri 3:	4, 2, 1

Idea on optimization:
in your scenario, each vertex is used by the same number of tris. In one pass that looks like the reordering+unwelding pass, do no unwelding, instead mark a boolean in the vertex “bool isProblematic” = true. Clear all vertices’ pFlatData and start reordering again, but not picking problematic verts if possible.
That way, your problem-case will not unweld any vertex.

This is definitely a very interesting approach with the current capabilities of OpenGL. This thread is now in my bookmarks, thanks!

I still believe that “indexing per attribute” would be a nice feature to have in a future OpenGL version.

Thanks again!

Excellent post! This has been bugging me for some time. There is just one downside to using per-attribute indices as suggested: cache misses. However this is all extremely flexible, actually, it’s flexible enough for me to want to implement it :)!

This is generally the new method with OpenGL now … MAKE IT YOURSELF !!! :slight_smile:


Get your facts straight, before posting some rubbish.