Indexing per vertex attribute

Originally posted by HellKnight:
[quote]Originally posted by tfpsly:
Oh, wait! In Q1 and Q3 maps, you actually do not need at all the texture coordinates! You may generate them in the vertex program with 1 dot3 and 1 sub as long as you set one vector and one float (which is just a vec4) VP constant each time you change the current texture!

Oh yes :slight_smile: , you mean everytime you draw a face you’d set up some uniform vertex program parameters, change texture states etc… In that case I’d just make all the computations on the CPU and use intermediate mode hehe :slight_smile:
[/QUOTE]?? You really do not understand how the UV are computed in the Quake games then…
I was speaking about setting up 1 vector const each time you change the current texture. Which mean once per material. At most as many time as you call a glDrawElements, but it can be less.

Well, could you please explain then how the texture coords are expected to be generated - per texture? You have, say, 100 faces each one with a different orientation etc., but with the same texture… How do you compute the tex coords then? With a SINGLE vp uniform parameter… ??

Your approach would work if all the faces with a given texture lied in the same plane and had no texture transformations applied (rotation/scaling/translation) and that’s rarely the case. Yeah, there’s this special case in Q1/2 where the BSP compiler splits up faces > 256x256 units, but that’s only because of lightmap issues and is not done in modern engines anymore, AFAIK…

There was an old post on these forums (I can’t remember by whom, perhaps Jon Watte), where someone said they took a few game models and stored them as index per attribute and index per entire vertex.

The index per vertex was actually smaller because you didn’t have to store duplicate index information. Except on hard edges (eg. where you’d want a double normal per position), the indices are the same.

I can still see how having an index per attribute would be quite useful in certain situations, but to convince me that it uses less RAM you will need to provide more evidence

Originally posted by Korval

If you use the same position index more than once, the second time you use it that position may still be in the pre-T&L cache. So you save memory bandwidth, compared to uploading the same duplicated vertex position data with the traditional method.

For that to work effectively, so that it actually gives performance benefits, the mesh would have to be extremely well behaved. Given a sufficiently large mesh, it would be non-trivial to produce such a well behaved mesh. A tri-strip would probably be more well behaved, “out of the box” with minimal effort.

Originally posted by Hellknight

Heh, if you’re talking about the days of Quake or maybe Quake 2, yes, you had 1000 - 2000 vertices PER FRAME. But if we’re talking about recent days, say Doom3, HF2 or whatever, I wouldn’t wonder if the total vertex count per level is over a million or so. A simple example.

I highly doubt that, although i confess that i do not have first hand information. But considering that most of the objects in games are dynamic these days (due to physics), they are not part of the actual BSP mesh, but are added as seperate entities, which are rendered in another pass that does not require any vertex duplication. That leaves very few vertices in the BSP mesh alone. I think that even Doom3 and HL2 wouldn’t have more than 200,000 tris even for a large BSP mesh.

Originally posted by Dez

Some could argue that this functionality is available on SM3.0 hardware, that is the same hardware that supports vertex texturing and we can emulate the flexible data fetching offered by per attribute indexes using vertex texture sampling.

Instancing was a part of SM2.0b which was supported by R3xx series.

Originally posted by Dez

As for the cache coherence argument and complexity of implementation, I would say that hardware vendors could use very basic way of implementation, skipping caching where it is to difficult to implement and let developers decide to use this functionality or not based on results that they get in their unique case. Needles to say that implementing this “the right way” :wink: could lead to healthy competition between hardware vendors.

I disagree with that because too many options can result in more confusion and hence bad code in many applications. Couple that with different implementations from different vendors, and you are adding more complexity (and confusion) in code optimizations.
I think keeping in mind, the bloated index buffer that would result from using such a technique (as mentioned by bpoint), and the results of jwatte’s analysis, i think that this technique would probably not provide many benefits.
But that’s just my opinion :slight_smile: .

I highly doubt that, although i confess that i do not have first hand information. But considering that most of the objects in games are dynamic these days (due to physics), they are not part of the actual BSP mesh, but are added as seperate entities, which are rendered in another pass that does not require any vertex duplication. That leaves very few vertices in the BSP mesh alone. I think that even Doom3 and HL2 wouldn’t have more than 200,000 tris even for a large BSP mesh.

I don’t know about Half-Life, but Doom 3 certainly does not use BSP-trees in that old fashioned way anymore, at all.

There was some .plan-file from John Carmack, where he said, that the D3 engine is based on a sector-portal system (well, we all know that anyway). He did say, that there are still BSP-Trees used for certain stuff (most certainly for some sorting, or to find out in what sector you are), but that they are not used as in Q1-Q3.

I did use BSP-Trees quite a while and everyone who did, knows, that they are quite useless for such high-res levels which you see in modern games. The problem is, that because of splitting they generate a multiple of the triangle-count, that the original mesh has.

Of course, you can simply not split triangles and use BSP-trees only for rough front-to-back sorting (or for glass back-to-front, or so), but then it is more efficient to simply do rough front-to-back sorting with your sectors, because the next problem with BSP-Trees is, that they really don’t care about materials and therefore state-changes.

So, in general, BSP-Trees are not very useful for modern-games. At least not to use them for rendering. They, of course, can still be useful in some other situations.

Therefore: Why do you use BSP-Trees as an example to explain how several vertex-indices could be useful???
And, yes, most BSP-Trees won’t hold that many faces, simply because it would be mad, not because it is not necessary.

To the topic: Actually, i am quite happy not even to have the option to use several indices. This makes my life a lot easier, because with that feature i would need to test and optimize for another, quite complicated, case. And since it is doubtful that i could really speed up anything using the feature, i am happy not even to have the option to waste my time on it.

Certainly, there are special cases, where it is obvious, that the feature would be useful, but those are rare and not common in games. And that is, why the hw vendors won’t care to think about it.

Jan.

Oringinally posted by Jan

Therefore: Why do you use BSP-Trees as an example to explain how several vertex-indices could be useful???
And, yes, most BSP-Trees won’t hold that many faces, simply because it would be mad, not because it is not necessary.

First of all, never for once did i imply that several vertex indices are useful. Infact i implied quite the contrary :slight_smile: . Secondly, i quoted BSP trees because that’s the one example in my knowledge that leads to horrible duplication, and the format, as we all know is quite popular.

Ok that’s something I found on the net. It’s a FarCry benchmark summary:

[b]

The results of my run are dumped in a file called regulator.log in the directory Far Cry\Levels\Regulator.

As you run each benchmark multiple times, the results just keep appending to the same log file. They look something like this:

TimeDemo Play Started , (Total Frames: 3044, Recorded Time: 55.50s) !TimeDemo Run 0 Finished. Play Time: 51.59s, Average FPS: 59.00 Min FPS: 39.39 at frame 2388, Max FPS: 88.46 at frame 866 Average Tri/Sec: 5386234, Tri/Frame: 91292 Recorded/Played Tris ratio: 1.11 !TimeDemo Run 1 Finished. Play Time: 49.08s, Average FPS: 62.03 Min FPS: 39.39 at frame 2388, Max FPS: 96.84 at frame 1450 Average Tri/Sec: 5420742, Tri/Frame: 87393 Recorded/Played Tris ratio: 1.16 TimeDemo Play Ended, (2 Runs Performed)

[/b]

As you can easily see, we’re talking about ~100.000 triangles PER FRAME. Ok, there are these large outdoor scenes in FarCry, but it’s just an example. Now guess the total polygon count per map.

Oh yes, and FarCry isn’t the newest game after all… Judging from the screenshots of the next Unreal game, for example, I could imagine having such a high poly count even in indoor scenes…


I think that even Doom3 and HL2 wouldn’t have more than 200,000 tris even for a large BSP mesh.

200.000 * 3 = 600.000… not that far away from 1.000.000…

EDIT: You appended while I was typing. Well, I’m not pushing for that feature either, and if you read my first post carefully, you’ll see that IMO we can live without it. I just wanted to give an example and stated that the savings wouldn’t exceed some couple of megabytes, and that’s neglectable nowadays…

Wow, I go away for a day and this thread gets lots of replies! A bit unexpected… I’ll try to check it more often from here on. I noticed that there are a few posts which are straying a bit off-topic. Please try to stay on-topic as best as possible. :slight_smile:

I’ll reiterate a few points again, because it seems a few people might have missed them. First of all, I speak from experience. I’ve developed and shipped two games on the GameCube, and have worked on the DS a bit too. I’ve also done a few games on PSX, N64, and PS2 as well, but those are a different story.

For both of our GameCube titles, we utilized multiple indices for each vertex attribute because it saved us memory. Even with the extra indices, the models were overall smaller than if we used a single index to represent a vertex. This was not with simple test model data – but the actual models used in the game. Overall, I’d say we probably saved on an average of 25% per model.

It seems that no one mentioned (or noticed?) that one of the big gains for using multiple indices is not just to eliminate duplicated position coordinates, but to eliminate duplicated normals and duplicated color values.

Normals are, IMHO, probably the biggest gain with using multiple indicies. You can design a big map with lots of vertices for lighting, but only use a small fraction of the same normals to represent it. Colors are often duplicated as well, but since they’re four bytes and an index to represent it requires two bytes, you certainly haven’t saved much.

I believe the cache-coherency issue was summarized by Korval rather well:

If you use the same position index more than once, the second time you use it that position may still be in the pre-T&L cache. So you save memory bandwidth, compared to uploading the same duplicated vertex position data with the traditional method.
Remeber that is applies to any attribute that you index, not just positional data.

If it takes some solid facts to get some heads turned, I’ll code up a tool which reads some kind of model data and splits it out into two formats: one would be where each vertex is represented by a single index and data is duplicated, and the other would use multiple indices to reference unique vertex data. It might take me a few days to get it done, but hopefully it’ll show that this functionality is useful.

Finally, I’d like to add that there are at least six different ways (glBegin/glEnd, glArrayIndex, glDrawElements, etc) to draw vertex data in OpenGL. Why not one more? :slight_smile:

Thanks for all of the replies, and keep the comments coming.

Edit: Another post was added while I was writing mine, and I’d like to comment on it.

EDIT: You appended while I was typing. Well, I’m not pushing for that feature either, and if you read my first post carefully, you’ll see that IMO we can live without it. I just wanted to give an example and stated that the savings wouldn’t exceed some couple of megabytes, and that’s neglectable nowadays…
I can certainly live without this feature too. I just have to manage multiple sets of data for multiple platforms. If this feature did exist, it would make OpenGL more programmer-friendly. Not just for me, but I’m sure for others as well. Also, a “couple of megabytes” might not make much of a difference on Windows, but if you’re using an embedded system (OpenGL ES?), a megabyte is fairly large.

Even at worst case where rendering was actually slower than if a single index was used, the user would have a choice as to which format would be better to suit his needs. Finally, I also don’t think anyone can honestly say whether it would be slower or faster until it actually gets implemented into hardware. Even after it gets implemented, the GPU designers will probably be able to come up with some extra tricks to make it work better – who knows. The only facts we have now is it does save on memory, and there are others who would like to have such a feature. IMHO, because of this alone it should be considered for inclusion into OpenGL.

Also, a “couple of megabytes” might not make much of a difference on Windows, but if you’re using an embedded system (OpenGL ES?), a megabyte is fairly large.

But you wouldn’t have maps with 1.000.000 vertices either, so the savings would at best be a couple of KILObytes - and then again on recent embedded systems that’s quite acceptable.


The only facts we have now is it does save on memory, and there are others who would like to have such a feature. IMHO, because of this alone it should be considered for inclusion into OpenGL.

Unfortunately, speaking from my experience, the industry doesn’t work that way. Just because you and 2 other guys want this implemented it’s very unlikely that someone at nVidia or ATI would say: Hey guys, let’s do that! In recent years, it pretty much went the other way round. Now, if you could persuade Carmack to push the hardware vendors for that feature, chances are that we’d have this on next-gen hardware :slight_smile:

EDIT: @ Zulfiqar Malik

I did some more tests, I just wanted first-hand information. I opened some Quake3 levels wit a hex editor and looked up the size of the vertex lump. An average-sized Q3 map had ~70- 80.000 vertices, but I even found one used for benchmarking (chartres) that had ~175.000. That’s the vertex data BEFORE the curved surfaces get tesselated, so we can safely assume that most of them belong to the low-detail BSP model = high vertex duplication ratio.

Oh yes, and we’re talking about a game that’s 5 years old…

Originally posted by HellKnight

As you can easily see, we’re talking about ~100.000 triangles PER FRAME. Ok, there are these large outdoor scenes in FarCry, but it’s just an example. Now guess the total polygon count per map.

100,000 Tris/Frame is nothing for an outdoor scene (considering next gen engines). The engine that i am currently writing is capable of rendering 2 MTris/frame at 60 FPS with texturing and interpolated per-vertex lighting (using 1 directional light)! Thats a good 110MTris/s and i didn’t have to duplicate even a single vertex! Indoor maps are a different story altogether. The whole idea of per-pixel lighting and other detail preservation techniques was to give sufficiently good detail at lower levels. If you are going with such high poly counts, you might as well do displacement mapping, but even then you can utilize the hardware for that!
After bpoint’s last post i assume that he is talking about something completely different. So no point in speculating. Let him come up with a good enough example for review.

But:

Yes, several indices might save memory. However, i think the pre/post tnl cache usage becomes much worse, because now the GPU has to check every single index, if it is in the cache or not and only if ALL of them are in the cache, the complete result can be reused, because in the days were a shader is nearly always in use, you can’t just say, “hey, the position-index is the same, so the resulting position will be the same, too”.

So, speed-wise, i think this will be only another burdon.

Zulfiqar Malik: Sorry, i didn’t want to adress you specifically, with “you” i meant everybody who was talking about BSP-Trees, because i think, this is a really bad example these days.

Jan.

I think that hardware will soon become so generic and flexible, that there will be no real difference between texture data, vertex data, etc… And at that point, you could just create a buffer storing indices to whatever you want, and then you can just fetch the values in the vertex shader (you could already do this with SM3.0). Of course, if you told me to do this today, I’d say ahh, that’s gonna be too slow. But in the future? Who knows :slight_smile:

jm2c,

Andras

Doesn’t the DX10 specification that MS is trying to push on vendors for Vista/Longhorn allow for indexing separate attributes?

Oh I posted something about all this in

http://www.opengl.org/discussion_boards/cgi_directory/ultimatebb.cgi?ubb=get_topic;f=3;t=013930

some time ago.

Originally posted by knackered:
(Humus speaks for ATI)
To clarify, I do work for ATI, but not everything I say here neccesarily represent an official opinion of the company. Especially if we’re talking about console platforms. I’m not involved in that at all. In this topic I’m only stating my own personal opinion, which may differ from what our hardware guys think.

Originally posted by bpoint:
For both of our GameCube titles, we utilized multiple indices for each vertex attribute because it saved us memory.
But you’re working on a console. Those platforms tend to be tight on memory, but on the other hand have extremely fast buses and/or very fast embedded memory. It may make sense there if memory footprint is more important than shading speed. The typical console artwork also isn’t comparable to the typical PC artwork. The more high-poly your model is, the less saving opportunities you get from multiple indices since vertex sharing will be greater and the relative ratio of edge vertices vs smooth vertices gets smaller. I did a quick test in my little engine I’m working on, and sure, I got a decent save (30%) on the map, but it’s mostly low-poly and per face data anyway (BSP-style ala UnrealEd). On the other hand, the gun I modelled in Maya turned twice as big.

Originally posted by bpoint:
Finally, I’d like to add that there are at least six different ways (glBegin/glEnd, glArrayIndex, glDrawElements, etc) to draw vertex data in OpenGL. Why not one more? :slight_smile:
If anything we need less ways to draw. glArrayIndex should go away, glBegin/glEnd should move into GLU or something. The only functions needed for the core API would be glDrawArrays and glDrawElement.

Originally posted by bpoint:
I just have to manage multiple sets of data for multiple platforms.
You only need one set of data. Generating a single-index stream from multiple index streams is a piece of cake with a Kd-Tree and it can be done in O(n*lg(n)) time, which should be quick enough to do on load time. That’s what I do myself actually since multiple indices are more convenient for preprocessing model data.

Originally posted by Humus:
[quote]Originally posted by bpoint:
Finally, I’d like to add that there are at least six different ways (glBegin/glEnd, glArrayIndex, glDrawElements, etc) to draw vertex data in OpenGL. Why not one more? :slight_smile:
If anything we need less ways to draw. glArrayIndex should go away, glBegin/glEnd should move into GLU or something. The only functions needed for the core API would be glDrawArrays and glDrawElement.
[/QUOTE]Generally, I agree with reducing the number of draw calls in the API (there was supposed to be a “pure” version of GL2.0, without all the 1.x legacy, no?), but we still need immediate mode for pseudo instancing, so we can change certain attributes between draw calls.

Originally posted by andras:

[…] but we still need immediate mode for pseudo instancing, so we can change certain attributes between draw calls.

Oh yes, I wouldn’t like to rewrite my font rendering lib either (which, for simplicity, uses intermediate mode). If I used VBOs instead, the whole procedure of rebinding the vertex buffer objects would probably take more time - just for the couple of chars I’m drawing every frame.

You only need one set of data. Generating a single-index stream from multiple index streams is a piece of cake with a Kd-Tree and it can be done in O(n*lg(n)) time, which should be quick enough to do on load time. That’s what I do myself actually since multiple indices are more convenient for preprocessing model data.
Yes but, to be blunt, you don’t make games. You make demos.

Games, particularly cross-platform console games, don’t have the luxury of doing O(nLog(n)) work even at load time. More and more game (and this will be far more true on next-gen platforms) will be based on streaming, so load time is gameplay time. As such, the closer you can get the on-disc data to match the in-memory representation of that data, the better off you are.

Then again, it is better to have each platform have its own final version of the data anyway, just to make loading work faster.

Originally posted by Korval:
Yes but, to be blunt, you don’t make games. You make demos.
And I work with several of the world’s top game developers. Game requirements is not an unknown field for me.

Originally posted by Korval:
Games, particularly cross-platform console games, don’t have the luxury of doing O(nLog(n)) work even at load time.
O(n*lg(n)) time grows close to O(n) when you reach a fair amount of data since lg(n) barely grows. It’s quite realistic to do this work at load time. My implementation, which is still quite generic, indexes up an 1.4MB vertex array in 0.0026s. At that speed you can process in excess of 100MB in a fraction of a second.

Originally posted by Korval:
More and more game (and this will be far more true on next-gen platforms) will be based on streaming, so load time is gameplay time. As such, the closer you can get the on-disc data to match the in-memory representation of that data, the better off you are.
That’s a false assumption in the general case. Far more commonly the smaller the data set on the disc the faster the load time. It’s not the CPU processing that’s the limitation normally, but the read speed from HD, or even worse, from CD. A simple compression scheme will likely improve your load time.