Memory and Speed issues

greggman · August 18, 2004, 8:52pm

continuing from a comment in the design thread…

everything is peachy until you actually start making real data. Real data = millions of polygons. For example Jak and Daxtor’s levels are millions of polys as are the levels of Unreal 3.

What that means in terms of Collada are things like using an ascii based format for vertex positions, UVs, colors, weights and normals may not be a good idea. Maybe a well defined CDATA format would be better in those cases.

It also can mean issues for XML parsers and the design of the format. Since most XML parses store lots of extra info PER internal element, any schema which ends up with thousands or tens of thousands of elements should be avoided. An simple example would be if every vertex was a separate element. That’s not the case with Collada so far.

marcus · August 19, 2004, 11:24am

The COLLADA schema will certainly grow in the number of element types and attributes as we add features. I don’t foresee an explosion of elements though because the design is both generic and parametric at the level of containing data blocks.

I expect to see reuse of the <param>, <source> and <array> elements and not the introduction of domain specific elements to hold data.

greggman · August 19, 2004, 8:19pm

I wasn’t suggesting more types of elements. In general I was suggesting 2 things. One that for example

Be changed or at least optionally be allowed to be something like

Where “$%%$$#”%"#$!$"$%$%&"" is the binary representation of those 5 floats in some format specified by Collada. Most likely the standard PC float format in PC byte order since that’s the most likely place for this data to be used.

Because while parsing 5 ascii representations of floats and converting them real floats is not a big deal, parsing 5 million of them is. If the format allowed this binary option and all the exporters supported it either by default or by option then when parcing an array, since I know the size “5” and I know the binary representation of the type because it would be specified in the Collada spec then once I hit <![CDATA[ I would instantly know exactly how many bytes to read out of the file and I could load it directly into memory and use it instantly.

Given that polygon counts are going to up at least an order of magnitude for the next gen I think it would be a good idea to, if possible, optimize this format, while it’s still possible, to be better aimed and large data sets.

This wouldn’t scarifice any usefulness or genericness as far as I can see but it would make it possible to use the data faster and bring conversion times down.

The other thing I was suggesting is to try to avoid, where possible, massively repeating elements. The spec already does this as far as I can tell, I was just pointing it out because I wasn’t sure if that was intentional or just luck

For example IF the vertex format was something like

or worse

That would end up being hugely expensive to parse and most XML parsers would choke on it once the files sizes got really large since every single vertex would be stored in a separate element structure in the internal parse tree.

gabor_nagy · August 20, 2004, 11:36am

As a reference:
on a fairly average PC, the EQUINOX-3D COLLADA importer takes less than 3 seconds to read a 526338-triangle terrain model.
The file has more than 5.5 million floats (vertex array + normals) and about 1.6 million ints.

gabor_nagy · August 20, 2004, 11:48am

Luck, huh?

It is very intentional. We actually had to fight some forces that wanted more verbose vertex representations…

remi · August 20, 2004, 5:13pm

What that means in terms of Collada are things like using an ascii based format for vertex positions, UVs, colors, weights and normals may not be a good idea.

You can already store this information in the most compact binary form you want by using external references in COLLADA if you need to.

The main issue is with the way the data is segmented, and having the capability of dynamic paging or/and partial update in the game engine/content tools. This provides several order of magnitude speed improvements, while speeding up floating point loading will only marginally help.

Panajev2001a · August 21, 2004, 6:44am

Which would mean that he would have to write again a partial exporter.

Am I right ?

I fear that he would be not the only developers that would want to do that and then decide they do not need COLLADA.

Making everyone happy is not possible, but an industry stabdard should aim to please at least a large chunk of developers to assure acceptance.

greggman · August 21, 2004, 8:28am

The Unreal 3 site claims 100 million polys in their outdoor levels so at 3 seconds for 0.5 million polys that would take 10 minutes to load.

I’m not suggesting any less genericness. I’m only suggesting a simple optimization for fairly standard types. Arrays of bits, arrays of ints, arrays of floats and possibly arrays of 2 floats, 3 floats and 4 floats if just arrays of floats doesn’t cover that. I wouldn’t give up on collada if it wasn’t added but I guess it just seemed like a pretty simple thing to ask for, it didn’t seem to me like it would really break anything and it would speed up things to some small extent.

os1 · August 23, 2004, 5:36am

Reality check - if our current-gen PS2 Maya scenes loaded in 10 minutes, we’d be over the moon.

gabor_nagy · August 26, 2004, 6:51pm

[/quote]
That’s the source data, before the detail bump-map generation! That’s not what you see in the runtime.

Thanks OS. The same scene (if it’s the swamp) loads in ~5 seconds with the above mentioned setup (that’s the non-truncated Collada version, including reading all the textures from TIFFs!).
I’m sure that’s not all the data, but hey it’s XML and Maya has a binary format…

A binary format wouldn’t be orders of magnitudes faster, unless it’s a direct representation of the internal format of a specific runtime (=non-interchangeable) and doesn’t deal with issues such as byte-order independence.
As a reference: my binary format is only about 2-3x faster than Collada.

Also, I doubt that they read the whole scene into Unreal all at once.
That would be 3.5GB with triangles that have only positions per vertex (no normals, texcoords etc.), less with triangle strips, but still…

greggman · August 26, 2004, 11:12pm

As a reference: my binary format is only about 2-3x faster than Collada.

Only 2 to 3 times faster? Hmmm, most teams would kill to have a 10% speed increase in any part of there system and your sneezing at 200-300%???

byte-order independence

Is a non-issue. Define what it is in the spec and the issue is solved

I would be willing to make a bet that by the end of the cycle of the next-generation you’ll find a text based format for these large chunks of data to be a bad thing. If it’s such a good then thing then I propose we make collada also store the textures. RGBA, one ascii float per channel per pixel. Clearly that’s the argument here. By your standards that would be “fast enough”.

All I’m arguing for is a little foresight. I’m not suggesting a single thing that would make collada less generic, less useful, less anything. I’m only suggesting a simple optimization.

os wrote:
greggman wrote:
The Unreal 3 site claims 100 million polys in their outdoor levels so at 3 seconds for 0.5 million polys that would take 10 minutes to load.

That’s the source data, before the detail bump-map generation! That’s not what you see in the runtime.

Yes, and if I needed to write my own bump-map generator I’d need to export the file to some format, all 100 million polys of it. The proposal on this board is that format be collada. Why limit it’s use from day one? I haven’t heard a single actual argument against the optimization. Only that you’re not complaining about the speed today. That’s not an actual argument against the suggestion.

gabor_nagy · August 27, 2004, 5:58pm

Even if the file format is standardized on, say big-endian, you still need to convert ints floats etc. if you are on a little-endian machine (and vice-versa). So, I’m not sure what you mean here.

There are already good standard 2D formats out there…

We’re not arguing with your suggestion, (we like suggestions and we’re reviewing it ), just trying to put things in perspective:
Collada will not be the bottleneck for a while:
The real bottleneck are the current commercial modelers. Some take as much as 60 (yes six-zero) times longer to load a file from their own binary format than it takes the reference Collada implementation to load the same scene.
Of course that doesn’t mean that we are not trying to optimize Collada even further.

I’m not convinced that your binary-in-encoded-ASCII float representation would be any faster to read than our optimized float parser.
You’d need to do a lot of this: add (or OR), bit-shift, store on a 4-byte aligned slot, pointer-cast, read from memory etc., instead of just using a float register…
If you’re suggesting inlining the actual binary value of the float, that would of course break XML parsers…
I’d love to see a speed comparison if you’ve done one.

greggman · August 29, 2004, 7:13pm

Yes, I’m suggesting inlining the actual binary value of the float, int, etc. It would not break any XML parsers. XML supports the CDATA format exactly for the purpose of storing binary data inside the XML file.

gabor_nagy · August 30, 2004, 11:26am

You can’t put arbitrary bytes in CDATA, because, for example a ‘<’ character will make a parser think that it’s the beginning of an XML tag.
We had this issue when inlining Cg shaders.
This line can’t be in a CDATA section (and it’s not even binary data!):
<code>
…
LDiffuse = LDiffuse < 0.0 ? 0.0 : LDiffuse;
…
</code>

These 5 characters must be “escaped” in XML content and here’s the encoding:

< < less than
> > greater than
& & ampersand
’ ’ apostrophe
" " quotation mark

See for example:
http://www.fawcette.com/vsm/2002_11/onl … _11_05_02/

So the example would look like this in the file:
<code>
…
LDiffuse = LDiffuse < 0.0 ? 0.0 : LDiffuse;
…
</code>

Fortunately XML libraries like LibXML do the encoding/decoding for you, but there is a slight overhead (in addition to the conversion overhead I mentioned earlier) and I’m not sure they could handle binary data there.
Not to mention that text/XML editors would go crazy if you tried to edit a partially binary file.

Regards,

Gabor

greggman · August 30, 2004, 8:32pm

You can’t put arbitrary bytes in CDATA, because, for example a ‘<’ character will make a parser think that it’s the beginning of an XML tag.[/quote]

That’s a bug in your XML parser

From the XML spec

2.7 CDATA Sections
[Definition: CDATA sections MAY occur anywhere character data may occur; they are used to escape blocks of text containing characters which would otherwise be recognized as markup. CDATA sections begin with the string “<![CDATA[” and end with the string “]]>”:]

CDATA Sections
[18] CDSect ::= CDStart CData CDEnd
[19] CDStart ::= ‘<![CDATA[’
[20] CData ::= (Char* - (Char* ‘]]>’ Char*))
[21] CDEnd ::= ‘]]>’

Within a CDATA section, only the CDEnd string is recognized as markup, so that left angle brackets and ampersands may occur in their literal form; they need not (and cannot) be escaped using “<” and “&”. CDATA sections cannot nest.

An example of a CDATA section, in which “<greeting>” and “</greeting>” are recognized as character data, not markup:

<![CDATA[<greeting>Hello, world!</greeting>]]>

If you wanted your shaders not to need to be escaped all you needed to do was change them to this


<code> 
<![CDATA[
LDiffuse = LDiffuse < 0.0 ? 0.0 : LDiffuse; 
]]>
</code>

Then no escaping is necessary. You’l see lots of examples of this in RSS syndications. Example

gabor_nagy · August 31, 2004, 1:25pm

Thank you for the reference, but I think It’s easier to “escape” 5 individual characters than keep checking that you don’t have the CDATA terminator in your string…

I still wouldn’t want to inline binary data in an XML file. Besides the fact that it just feels dirty, it would make the file un-editable by most text editors.
However unlikely, it is possible that arbitrary binary data would produce the CDATA terminating sequence.
When doing recursive "grep"s, I have many times founds matching “strings” in random data segments of binary files…

Even with strings, it’s a funny thing. Let’s say you want to include an “Essay about XML” in a CDATA section.
Of course the strings for starting and ending a CDATA section would be “taboo”, so you’d be referring to them as “you know, that string, I can’t say it here, but I’ll spell it…”.

The only bullet-proof way I know for storing arbitrary byte sequences directly is in a binary file and with the length of the sequence defined/stored…
Maybe you can enlighten me.

Regards,

Gabor

greggman · August 31, 2004, 8:13pm

Maybe you need to go back and read the original proposal.

I’m not suggesting you can’t use text. I’m suggesting the default be binary for speed reasons but that text would still work.

I’m also not suggesting you have to search for the end of the CDATA, I’m suggesting that given that you know the format since it would be defined in the standard and you know the number of elements in the piece in question since that would also be defined in the standard, then if you know for example you are reading an array, marked as binary of 6000 ints you know that once you see <<[CDATA[ you can read exactly 24000 bytes directly into an array of ints. If you get another array marked as binary with 107237 floats then again, once you see <<[CDATA[ you know you can read 428948 bytes directly into memory. At that point you can check if the next byte sequence is ]]> If it’s not your XML is malformed.

If you want to edit your arrays in a text editor then you can run it through some tool that spits it back out as text or pick “export as text” in your exporter. That will let you have your text but still have the default be much faster to load the file.

The company that has the largest level sizes to date, Naughty Dog, refused to use a text format in the past precisely because they felt with their data being so big and always getting bigger that that would be an issue. Supporting a binary option would address similar and well grounded fears.

I feel like this issue is one of those things like using 2 digits for the year or 32 bits for the number of seconds since Jan 1st, 1970. It sounds good at the time since you assume someone will fix it later. Like you assume parsing will be fast enough or you assume CPUs will get faser. Of course that’s never been true in the past. We always manage to need more power than we have and we end up having to do stuff to compensate. Why work around this later? Make it fast now and the problem goes away.

As far as I can tell your only real objection is “ew, that’s sounds icky to me”.

The ability to edit these files in a text editor should be a non-issue. You shouldn’t be editing middle format files because your edits will get lost every time your artists re-export. Being able to edit as text is great for getting it all working or testing out simple ideas or looking for bugs but that shouldn’t be the deciding factor since 99.999% of the time that’s not the point.

gabor_nagy · September 1, 2004, 11:42am

Unfortunately an XML library would not know about this construct, so this would eliminate the major advantage of using a standard format (XML) and force people to implement the whole XML parser from scratch and hard wire it for Collada.

While it’s not rocket science (I sure have written my share of XML parsers), many people like to use libXML2, because it does a big chunk of the work

This sounds an awful lot like Microsoft’s “embrace and extend” scheme which you can’t possibly be advocating (look what they did to Java…)…

If a compliant XML parser can’t parse the file with a 100% certainty, it in my mind it’s not an XML file.

For example, you would not be able to look at the file in Mozilla’s graphical XML viewer which we use quite frequently to examine the structure of a file (you can collapse elements etc.).

This is what I meant by “dirty”. No, it is not a “feeling”. It means that it is technically unacceptable, because it breaks things and it is unreliable.

So, do we agree that straight binary in XML is bad?

If so, we can go back to the ASCII encoding…

greggman · September 1, 2004, 9:54pm

so, do we agree that straight binary in XML is bad?

No, we don’t agree. CDATA was designed specifically to allow binary in XML so suggesting putting binary in XML is suggesting we use XML as it was designed to be used.

Maybe I’m missing something but so far, the way the Collada schema is designed, when I use any XML library I’ve used to date and I look up an array element from an XML file, all I’m going to get is a pointer to a large string which I have to then manually parse on my own. Under my suggestion, that string would be binary data which I could copy diretly into an array of whatever type the array is.

Or does libxml2 actually convert array to a contiguous array of floats for you? If not there is no difference between a proprietary string of ascii data and a proprietary piece of binary data as far as an off the shelf XML lib is concerned.

As for your comment that binary would break looking at the files in Mozilla. Don’t you think first and foremost the concentration should be one whether or not the format facilitates making games, not on whether or not it can be looked at in some non game related browser?

gabor_nagy · September 2, 2004, 12:05pm

Well, designs are sometimes flawed…
Unfortunately it is not fully reliable, as many people in the field know:
http://builder.com.com/5100-6374-1050529.html
http://webservices.xml.com/pub/a/ws/200 … oints.html
etc. (just do a Web search… )

As I said, you need to know the lenght of a binary chunk to parse/skip it.
The terminating sequence scheme would only work if CDATA would let ME (or the XML library) specify it when I save the data. That way I could pick a sequence that my binary chunk does not contain for sure.
I guess the designers of XML did not think of this.

You could still save the float array in binary format in an external file. If that file only contains raw floats, we don’t need to invent a new binary sidekick to Collada.