glDrawElement with independant vertex, normal and texCoord indexes

zander76 · August 16, 2003, 9:33am

Originally posted by Cyranose:
[b] Ignoring the tangential issues here, the original question was whether it’s useful to have separate indices for the various vertex components when using vertex arrays.

Unfortunately, it is (and not just for the cube example), though I don’t think it would be easy to even simulate unless such an extension converts multi-indexed data to single-indexed data by duplicating on the fly (yuck, no go).

The desire for such a feature (for me) is not to avoid duplicating colors or texcoords in memory. It’s mainly to avoid duplicating vertices where two vertices at the same point in space have different attributes, such as different normals, different texcoords, or different colors.

The problem comes in in trying to optimize vertex caching. Two nearly identical vertices can’t be cached as one. This shows up wherever you need discontinuities in normals (e.g., hard edges), colors, or texcoords.

But under the hood, it’s likely that the vertex cache stores not only the post-transformed vertex (and normal) but the other post-transformed post-lit parameters too, all keyed by the original index value. So there’s a real question as to whether such an extension could reasonably be supported in HW without multiple caches and lots of extra complexity. And it doesn’t make much sense for vertex programs, where you’d want to cache the computed VP outputs, not the separate inputs.

Anyway, back to the original question. If you have a small number of colors, normals, and/or tex coords and want to re-use those with separate indices, you can simulate this “extension” with a simple vertex program. Load your arrays colors, texcoords, normals, into VP constants (up to the limits, say 96 total vec4s for 1st gen HW) and send down vertices with XYZ and color only. In the VP, use those color R,G,B values to do one or more table lookups for real colors, texcoords, and normals stored in the registers and emit the combined result.

This, of course, does not solve the cache re-use issue either. But I assume (or at least hope) vertices are cached post VP execution.

Avi

[This message has been edited by Cyranose (edited 08-15-2003).][/b]

Hello, as far as complexity goes i think it is very possible. Infact the biggest problem is bus bandwidth and not the processor on the card. Infact this problem could be very simple to solve on the card.

Currently there is a vertex, texture coords, normals and colors buffers on the card. Then of course there is a list that goes though and says something along the lines of

i = index
vertexArray[i], normalArray[i] and so on.

Do you really think that this would become alot more complicated if the had something a

v = vertexindex
n = normalindex

vertexArray[v], normalArray[n]

If the index list was something along the lines of a packed array

indexlist = vertex, normal, color, texture;

instead of just indexlist = vertex

I really find this hard to believe that this will add large amounts of complexity to a card.

Later, Ben

zander76 · August 16, 2003, 9:37am

Hello again.

While thinking of this i realized that this would also have to address another problem more then one set of texture coordinatees. Anybody have any ideas about how to do this… Perhaps telling gl what format you will use.

Perhaps defining a type.
set( GL_VERTEX_4F | GL_NORMALS | GL_TEXTURE0 | GL_TEXTURE1)

That way you can define the way this list is packed into memory.

What does everybody think of that.

Korval · August 16, 2003, 11:46am

Infact the biggest problem is bus bandwidth and not the processor on the card.

No, the biggest problem is with the post-T&L cache.

This cache, currenlty, operates by storing the post-T&L values and the single index that was used to create this set of post-T&L data. What you are asking, however, cannot be post-T&L cached at all.

Here’s why. If vertex data only has one index, then, if the vertex program is deterministic (same input yields same output), you know that if that index shows up again, you will get the same post-T&L data. So you may as well fetch it from a cache.

But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match. The reason for this is the fact that the caching mechanims, espcially with vertex programs, cannot guarentee that only the data from the input TexCoord0 value was used to compute the output TexCoord0 interpolant. Maybe it used TexCoord0 in addition to the position. Maybe it’s using generic attributes, which could be anything.

Therefore, you get poor caching behavior.

Cyranose · August 16, 2003, 12:01pm

Here’s some simple index-keyed T&L pseudo-code for an idea of why the multi-index is a problem. I’m not a HW designer or driver engineer, so I’m just guessing this organization – use it for discussion purposes only:

struct vertex
{ position, normal, color, tcoord };

For each index[i]
if (!is_in_cache(index[i]))
{
vertex = fetch_agp(index[i]);
// pulls from separate or interleaved agp areas into register memory
vertex = transform(vertex);
// pos+norm, color, tcoord can have matrix
vertex.color = light(vertex);
vertex.tcoord = texgen(vertex);
store_cache(index[i],vertex);
emit(vertex);
}
else
{
vertex = fetch_cache(index[i]);
emit(vertex);
}

The idea is to make the fetch, transform, texgen, and light calls as infrequent as possible. In your scenario, I don’t believe there’s a way to have multiple sets of indices and still have the cache be as useful – the solution would most likely be to have multiple caches, one for each kind of index. However, there’s a combinatorial problem, in that you may find your position, normal, and color cached, but you haven’t cached that combination. Therefore, the light() function at least would have to happen for every (or every unique) combination of V,N and C.

If you’re interested, try rewriting the loop above to have multiple sets of indices and see how often you have to call the expensive functions. It’s not pretty. You can’t just rewite the fetch() function to take multiple indices since the is_in_cache() function won’t give the correct answer in that case.

The key to making it work, I imagine, is to make the cache keyed by the either the concatenated indices (I1,I2,I3,…) or the complete untransformed vertex (P,N,C,T). That way you could expand your multiple indices first and still get some caching. But that cache-key is 8-16 times bigger and who knows if the lookup cost is linear or exponential with # of key bits. As I said, I’m not a HW designer.

Avi

[This message has been edited by Cyranose (edited 08-16-2003).]

zander76 · August 16, 2003, 12:49pm

Originally posted by Korval:
[b] No, the biggest problem is with the post-T&L cache.

This cache, currenlty, operates by storing the post-T&L values and the single index that was used to create this set of post-T&L data. What you are asking, however, cannot be post-T&L cached at all.

Here’s why. If vertex data only has one index, then, if the vertex program is deterministic (same input yields same output), you know that if that index shows up again, you will get the same post-T&L data. So you may as well fetch it from a cache.

But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match. The reason for this is the fact that the caching mechanims, espcially with vertex programs, cannot guarentee that only the data from the input TexCoord0 value was used to compute the output TexCoord0 interpolant. Maybe it used TexCoord0 in addition to the position. Maybe it’s using generic attributes, which could be anything.

Therefore, you get poor caching behavior.[/b]

So why not transfer it as indexed lists and have the card convert it to a format that it likes. Perhaps a compiled mesh. It packs images in a format that it likes. I don’t see why this is not possible for mesh data.

Ben

zander76 · August 16, 2003, 1:00pm

Originally posted by Cyranose:
[b]Here’s some simple index-keyed T&L pseudo-code for an idea of why the multi-index is a problem. I’m not a HW designer or driver engineer, so I’m just guessing this organization – use it for discussion purposes only:

[quote]

struct vertex
{ position, normal, color, tcoord };

For each index[i]
if (!is_in_cache(index[i]))
{
vertex = fetch_agp(index[i]);
// pulls from separate or interleaved agp areas into register memory
vertex = transform(vertex);
// pos+norm, color, tcoord can have matrix
vertex.color = light(vertex);
vertex.tcoord = texgen(vertex);
store_cache(index[i],vertex);
emit(vertex);
}
else
{
vertex = fetch_cache(index[i]);
emit(vertex);
}

The idea is to make the fetch, transform, texgen, and light calls as infrequent as possible. In your scenario, I don’t believe there’s a way to have multiple sets of indices and still have the cache be as useful – the solution would most likely be to have multiple caches, one for each kind of index. However, there’s a combinatorial problem, in that you may find your position, normal, and color cached, but you haven’t cached that combination. Therefore, the light() function at least would have to happen for every (or every unique) combination of V,N and C.

If you’re interested, try rewriting the loop above to have multiple sets of indices and see how often you have to call the expensive functions. It’s not pretty. You can’t just rewite the fetch() function to take multiple indices since the is_in_cache() function won’t give the correct answer in that case.

The key to making it work, I imagine, is to make the cache keyed by the either the concatenated indices (I1,I2,I3,…) or the complete untransformed vertex (P,N,C,T). That way you could expand your multiple indices first and still get some caching. But that cache-key is 8-16 times bigger and who knows if the lookup cost is linear or exponential with # of key bits. As I said, I’m not a HW designer.

Avi

[This message has been edited by Cyranose (edited 08-16-2003).][/b][/QUOTE]

I see what you guys are getting at. The only thing that still catches my attention and to be honest i don’t know how often this would help but caching the normals with give you the ability to only calculate the lighting for that normal once and have it cached for each subsiquent normal. But how often are there matching normals in a model, and is it worth the extra cache to run something like this.

The only thing i can really think of is each triangle would have 3 matching normals at each vertex(assuming the point was to normal each face attached to a vertex). Calculating them once would reduce some overhead.

Also one thing that you may want to consider is with multi-texture there is alot of coordinates to transfer. Assuming for a minute that you want to start using 4 texture (current max or the last time i checked anyway) then you have to pass a lot of info but how much of that is going to be duplicated. If you combined the 4 lists into one then removed duplicates then index that list. It has to be improving some of this.

Thanks everybody for the responces.
Ben

al_bob · August 16, 2003, 1:02pm

But, if the vertex data has multiple indices, you have to cache the data on the entire set of indexed data. You can only fetch from the cache if all of the indices match.

Or you can have multiple smaller caches, one per vertex attribute. Granted, this would be far less efficient / much larger than a cache that holds more data per cache line, as you’d need to store more indicies and more control logic for less data.

Of course, this is all IMHO.

You could also do as zander76 suggests, but then you don’t get much in return for the added complexity; the hardware still has to the added processing, and needs to use more memory for the vertex data, there’s additional logic for the unpacking, etc.

harsman · August 17, 2003, 7:42am

On the kind of complex models where saving bus traffic is important, you usually have lots of shared vertices. For all those shared vertices you still need to send all the extra indices for tex coords and what not. This means, that on typical complex models, you probably get more bus traffic with separate indices than without. If you’re flatshading all your models, then separate indices might be interesting.

Csiki · August 17, 2003, 8:30am

Originally posted by harsman:
On the kind of complex models where saving bus traffic is important, you usually have lots of shared vertices. For all those shared vertices you still need to send all the extra indices for tex coords and what not. This means, that on typical complex models, you probably get more bus traffic with separate indices than without. If you’re flatshading all your models, then separate indices might be interesting.

But if the future cards able to upload indexes too then it will be faster use less bandwith.
You would have just give the first index of indeces( ), and the length…

Korval · August 17, 2003, 5:41pm

Or you can have multiple smaller caches, one per vertex attribute. Granted, this would be far less efficient / much larger than a cache that holds more data per cache line, as you’d need to store more indicies and more control logic for less data.

That, and it doesn’t work.

As I pointed out, vertex programs can generate an interpolant from any of the data, not just the incoming interpolant value. TexCoord0 could be generated from TexCoord1, Color0, and the position. You can’t be sure, so any time one of them changes, you have to run the entire vertex program over again.

al_bob · August 17, 2003, 6:47pm

You can’t be sure, so any time one of them changes, you have to run the entire vertex program over again.

Yep. It’ll work as long as the data doesn’t change, which is a reasonable assumption considering the small window that the T&L cache caches.

At worst, the cache is just not in effect, which wouldn’t make you lose much anyway, as the indices are already optimized to elimitate redundancies.

That is - the cache is there in the first place because you can’t have multiple indices per vertex, and so data must be replicated.

Edit: I don’t know what I was smoking when I wrote that last paragraph. Just ignore it.

[This message has been edited by al_bob (edited 08-18-2003).]

Mazy · August 18, 2003, 2:35am

Is this really a problem? most modells are smooth on most places, or if they arent, you can have huge polygons ( like for walls), for spheres or other smooth shapes ( faces have mostly smooth corners ) you end up with most of the vertices having all data shared and then you only will sending a couple of indexlist to much each frame.

for a cube you still can use shared vertices, and specify GL_FLAT as shade mode, if youre using the index carefully you can make so that you only use one vertex as the first one in a triangle one time, and with GL_FLAT the first normal of the triangle is used for the whole surface.

zeckensack · August 18, 2003, 10:56am

Originally posted by zander76:
So why not transfer it as indexed lists and have the card convert it to a format that it likes.
Implementations “like” single-indexed meshes because they are easy to support, easy to build caches for, and offer ample opportunity for burst memory transfers. That’s why most people convert multiple-index meshes created by those modelling packages at load time, before passing them to the renderer.

Can you show us any non-trivial mesh where the calculation would swing in favor of multiple indices per vertex?

[This message has been edited by zeckensack (edited 08-18-2003).]

Cyranose · August 18, 2003, 5:51pm

Originally posted by zeckensack:
Can you show us any non-trivial mesh where the calculation would swing in favor of multiple indices per vertex?

I’m not entirely in favor of having multiple indices, as I think it complicates things, but here are some examples:

If post-transformed positions can be cached by a p-index and reused with new tcoords, the age-old texture-wrap seam problem goes away without duplicating verts. Not a huge win for most models, but worth considering.

If post-transformed normals can be cached and reused by a n-index, the savings could be large for many models. Consider a set of stairs, a ladder, or a sculpted brick wall (non-bump-mapped) – things with lots of repeated normals, whether or not the model is smooth or faceted.

In general, the re-use of normals, colors, and texcoords goes way up when considering the case of drawing many objects in a combined buffer with multi-index support, especially many instances of the same model, regardless of how smooth or faceted it is.

The big problem in the 2nd case is the need to relight re-used normals in local lighting sitiations, so the post-lit-color result can’t be cached with the normal as you might want.

Just some examples. Food for thought.

Avi

[This message has been edited by Cyranose (edited 08-18-2003).]

j111 · August 19, 2003, 11:02am

The problem is that transforming a normal is not very computationally expensive compared to actually doing the lighting computations for each vertex. You’re going to have to relight each vertex even if they share normals but not position (with point and spot lights, you can get away with lighting only based on normal with non-attenuated directional lights), so any savings on the transformation probably won’t even be noticable thanks to the lighting cost.

j

Korval · August 19, 2003, 1:07pm

If post-transformed positions can be cached by a p-index and reused with new tcoords, the age-old texture-wrap seam problem goes away without duplicating verts. Not a huge win for most models, but worth considering.

How do you know that the post-transformed position isn’t dependent on which texture coordinate is being sent? It’s possible; strange, but possible. And what happens when the user is using generic attributes rather than glVertex/Normal/TexCoordPointer? You have no idea which generic attributes correspond to position, normal, color, etc.

Cyranose · August 19, 2003, 2:43pm

Originally posted by Korval:
How do you know that the post-transformed position isn’t dependent on which texture coordinate is being sent? It’s possible; strange, but possible. And what happens when the user is using generic attributes rather than glVertex/Normal/TexCoordPointer? You have no idea which generic attributes correspond to position, normal, color, etc.

Hi Korval. Most of those examples are for traditional T&L. Also, J, granted the relighting for local lighting is a problem with caching just the normal transformation (I think I mentioned that directly).

Okay. If you want to talk about vertex programs, then having multiple indices becomes even more useful.

Let’s assume the VP outputs are what’s cached, even for multiple indices. We know it can be done, albeit for unknown cost. And given invariance, I hope we agree that for a given combination of indices, cached or not, the VP outputs should be the same.

With the current approach, we may define extra generic attributes as a way of getting pseudo-per-vertex data into vertex programs. Alternatively, we could store commonly used data in VP constants, but we often choose not to because of space constraints, or the API cost of loading them, or the reality of having to start and stop big buffer draws to load new parameters mid-stream. Stuffing the data into per-vertex attribs, even if it’s not truly unique, gets around this, albeit by wasting a lot of space and bandwidth.

But using multi-indices gets around both limitations nicely.

Case in point: skinning. Currently, I might load a set of matrices into VP constants to do matrix blending based on an matrix-index or indices that masquerade as vertex attribs. If I had a huge number of bones that didn’t fit in constants, I could chop up my skin, or alternatively use 4 generic attribs per vertex per matrix (not pretty if they’re changing a lot).

With multi-index, the matrices could be stored once in AGP, we can have many more of them, and we can blast larger groups of objects in single draw calls.

Even for the non-skinned case, multi-index here is a huge win, since now I can draw 10,000 objects with 10,000 different matrices (but otherwise shared state) all in one big batch instead of using 10,000 push/pop matrix calls around 10,000 draws.

Another case might be a VP that does selective lighting based on per-vertex params. Those light constants could be stored per-vertex, or they could be indexed and shared. Similar issues.

Finally on your point about using texcoords as data, I’ve found the times I use things like vertex texcoords or color as generic data inputs to VP algorithms, I’m often using them as indices into table data. So having that explicitly supported would obviate some or even many of those cases.

Avi

[This message has been edited by Cyranose (edited 08-19-2003).]

Korval · August 19, 2003, 9:06pm

Case in point: skinning. Currently, I might load a set of matrices into VP constants to do matrix blending based on an matrix-index or indices that masquerade as vertex attribs. If I had a huge number of bones that didn’t fit in constants, I could chop up my skin, or alternatively use 4 generic attribs per vertex per matrix (not pretty if they’re changing a lot).

With multi-index, the matrices could be stored once in AGP, we can have many more of them, and we can blast larger groups of objects in single draw calls.

You can get the same effect (plus others) just by having a bindable buffer of memory attached to vertex programs, which a vertex program can access like memory (alomst certainly, read-only). That’s much easier than having multiple vertex attribute indexing.

Also, with skinning done the via passing indices, a set of 4 indices and weights only takes up 2 vertex attributes. A single matrix requires 3 + 1 for the weight. To get 4-bone skinning, you have to give up 10 attributes.

Cyranose · August 20, 2003, 9:45am

Originally posted by Korval:
[b] You can get the same effect (plus others) just by having a bindable buffer of memory attached to vertex programs, which a vertex program can access like memory (alomst certainly, read-only). That’s much easier than having multiple vertex attribute indexing.

Also, with skinning done the via passing indices, a set of 4 indices and weights only takes up 2 vertex attributes. A single matrix requires 3 + 1 for the weight. To get 4-bone skinning, you have to give up 10 attributes.[/b]

I think it’s worse than 10, actually. But you could just barely fit 4-bone skinning along with base position and normal attribs and some UV.

Yes, it uses up your attribs indirectly (i.e., doesn’t waste memory per vertex other than those indices and weights), but it allows you to batch-render one or more characters, which is the whole point.

I certainly like the idea of bindable VP memory, but how much slower would it be? If there’s an early piece of pipeline that does nothing but de-index vertices (read various AGP arrays and output a complete vertex) for streaming to a VP block that needs only look at internal registers, that seems faster than having the VP fetch external memory at random points. If anything, I’d expect such bindable memory to get block-read into VP constants first, meaning they’re still space-limited. But IANAHWD.

But the real point is that with bindable AGP memory, you’d wind up using vertex attribs as indices into that memory, no? Well, then you’re doing exactly the multi-index thing, but without direct support at the caching level. In essense, this configuration is forcing the developer to turn their natural multi-index data into a single-index for the sake of caching and the current API. That could be made automatic to improve the semantic sense of things.

Avi www.realityprime.com

[This message has been edited by Cyranose (edited 08-21-2003).]

Korval · August 22, 2003, 10:18am

But you could just barely fit 4-bone skinning along with base position and normal attribs and some UV.

Um, 4-bone skinning the normal way only requires 2 attributes: 1 for the weights and 1 for the indices. It requires a lot of uniforms, but not many attributes.

I certainly like the idea of bindable VP memory, but how much slower would it be?

Why would it necessarily be slower? Presumably, the memory is cached in the same cache as the vertex attributes. The only difference is when the memory is being accessed.

If there’s an early piece of pipeline that does nothing but de-index vertices (read various AGP arrays and output a complete vertex) for streaming to a VP block that needs only look at internal registers, that seems faster than having the VP fetch external memory at random points.

Why does it have to be faster? As I mentioned above, as long as the memory is cached into the same cache as the vertex data, there’s no difference in performance.

If anything, I’d expect such bindable memory to get block-read into VP constants first, meaning they’re still space-limited.

Why would you expect that? The entire point of having the bindable memory is so that you remove the space limitations that are inhierent in current vertex programs.

But IANAHWD.

DHEUDJKE to you too.