Vertex Array Error

Zulfiqar Malik

During my own tests I’ve found, that ubytes, passed as COLOR, are fast, about 30% faster than pure float, and with 3 floats and 4 ubytes vertex got 16-bytes sized, that is little bit better, then pure float 28-bytes vertex (for example, for stars rendering).

There is weird issue with ubytes as normal, texcoord and so on, but as color and vertex attributes (normalized ubytes) - it is a miracle for saving memory and bandwidth.

Thanks,
Jackis

Hmmmm … what hardware did you use? I have tested ubytes with vertex position, normal and texcoord. My bad at not realizing that Humus was indeed talking about color pointers. I have not actually tested them with color, but why would they be any better? Does the driver handle them differently? If yes, then why?
I remember that while i was developing a terrain rendering system, my data (vertex position and normal) both we small enough to fit byte and ubyte respectively. But doing so gave such pathetic performance that i had to switch to shorts for position and floats for normals. But i did eventually end up saving memory because i packed more data in floats (in the mantissa and exponent) than the normal alone and that allowed me to geomorph terrain vertices in the vertex shader.
It also depends on the mangnitude of the test. The throughput of my terrain rendering algorithm was around 55MTris/s (using triangle lists that equates to around 165MVerts/s) on GFX 5700Ultra, textured (single) with per-vertex lighting from one directional light.

Ooops … i meant 165MIndices/s. Vertices were reused and i can’t remember the exact count, but it was fairly large to be called a valid stess test.

[quote]Originally posted by Zulfiqar Malik:
[b]

uint mycolor;

glColor4ubv(&mycolor);//Need to cast

RenderGeomWithVA or VBO (3_vertex + 2_texcoord)

Edit:
Also, if you are benchmarking you need to align you data. I think it was multiples of 32 bytes per vertex for ATI. I think it’s also fine for NV but not sure what NV officially prefers.

Rodix, before giving up on the arrays, you may want to try to lock them before use (glLockArraysEXT). My own tests, though with different access pattern and data, has in cases displayed over 50% speed improvement with compiled over plain vertex arrays.

Originally posted by Jackis:
There is weird issue with ubytes as normal, texcoord and so on, but as color and vertex attributes (normalized ubytes) - it is a miracle for saving memory and bandwidth.
What’s the issue? One issue I can see is that glTexCoordPointer() doesn’t even accept bytes or ubytes. glNormalPointer() should accept bytes though (but not ubytes).

Originally posted by Zulfiqar Malik:
Hmmmm … what hardware did you use? I have tested ubytes with vertex position, normal and texcoord. My bad at not realizing that Humus was indeed talking about color pointers. I have not actually tested them with color, but why would they be any better? Does the driver handle them differently? If yes, then why?
I remember that while i was developing a terrain rendering system, my data (vertex position and normal) both we small enough to fit byte and ubyte respectively. But doing so gave such pathetic performance that i had to switch to shorts for position and floats for normals.

For the programmable pipeline all these semantics like “color” and “normal” has little significance. You could just use glVertexAttribPointer() and pass all your data that way, which I would recommend since this function takes all valid types, whereas the others have various restrictions on what types they accept inherited from the fixed function.

If you’re getting really low performance with ubytes, make sure you’re using 4 ubytes and not 3.

Originally posted by V-man:
Really???
Isn’t color and secondary color prefered as ubyte?
I thought these were “native” and anything else was not. The only problem (as I thought) was that we had to pass in RGBA instead of the MS way of BGRA

In the past that might have been the case, but in this age of shaders everything that’s native to one particular attribute can equally much be native to others. So if you can do floats for normals, you can certainly do it for colors too. There’s no difference on the hardware level on loading a normal or a color into the shader. These semantics mean little for programmable parts and are only relevant on the API level.

Originally posted by Humus

For the programmable pipeline all these semantics like “color” and “normal” has little significance.

Thats what i thought. As for 4 ubytes, i don’t quite remember whether i tried using 4 because alignment must have been on my mind :slight_smile: back then. I don’t quite remember though. I will give it a shot soon.
But, keeping “personal tests” aside, can you tell me with certainty whether ubytes/bytes (aligned or non-aligned) are just as fast as, say shorts and floats? I am not just talking about R5xx, but R4xx and R3xx (minimum).
Thanks, its always good to get “first hand” information :slight_smile: .

This is becoming an intresting discussion.
I am now testing the unsigned bytes for the Color Arrays. It took me more than an hour to change all my code (color fading, alpha fading, many features for my particles, etc), but I want to know I am in the right track before I move on.

Humus said: If you’re getting really low performance with ubytes, make sure you’re using 4 ubytes and not 3.
Is this what you meant:

Rodix, before giving up on the arrays, you may want to try to lock them before use (glLockArraysEXT). My own tests, though with different access pattern and data, has in cases displayed over 50% speed improvement with compiled over plain vertex arrays.
Tamlin, I didn’t know that. I am checking on that too! Will give feedback as soon as I get it running. :slight_smile: Thanks!

Originally posted by Zulfiqar Malik:
But, keeping “personal tests” aside, can you tell me with certainty whether ubytes/bytes (aligned or non-aligned) are just as fast as, say shorts and floats? I am not just talking about R5xx, but R4xx and R3xx (minimum).
Thanks, its always good to get “first hand” information :slight_smile: .

It should be at least as fast, as long as it’s properly aligned and you use 4 components. Not too long ago I changed the font drawing in the ATI SDK framework to use ubytes for position and it’s just as fast on my laptop (mobility 9700) as it was when I used floats.

Originally posted by Rodrix:
Is this what you meant:
Yes

Zulfiqar Malik:
Hmmmm … what hardware did you use? I have tested ubytes with vertex position, normal…

Table 2.4 pp. 25 of the OpenGL 2.0 Specification:

Humus:
For the programmable pipeline all these semantics like “color” and “normal” has little significance.

In practice it can be of significance, though. I remember that I used to get terrible performance, when using bytes/ubytes with generic attribute 0 and it was on some rather recent hardware like NV40 or so. Also I think that I have read somewhere that some of the attributes can be interpolated with lower precision or even don’t have some of the components (like secondary color’s alpha). So I guess that in such cases internal representation does matter and specifying the attribute with different data type can have potential performance hit.

Hello, sorry for such a late revenue and for my English ))

I’ve experimented with enumerable data types only on nVidia hardware.
As Humus said, specification limitations doesn’t allow us to use conventional vertex arrays with integral types in a simple manner, but using vertex attributes removes this restriction.

Let me tell some words not about unsigned bytes, because this is simple, but about unsigned shorts, which is an alternative for GLHalfNV data type for elder nVidia hardware. This is my conversation on nVidia dev forum, I hope, posting this here is legal )))

=== Jackis
Hello!
Everybody knows, that OpenGL allows us to bind enumerable integer per-vertex attribs and it will treat it as floats in vertex shader.
That is very useful for packing, for example, everybody stores per-vertex colors as unsigned bytes, somebody even packs normals in bytes.
There is such a parameter in the description of glVertexAttribPointer(), called ‘normalize’. It handles treatment of integer: should it be mapped to real [0…1] interval, or leave it as is.
But when I want to bind short integer as an attribute, perfomance infinitely drops down. So, it is clear, that nVidia driver does this remapping by it’s hands using CPU power, not on the GPU.
O’key, I said, let’s discover this. And I actually got, that GL_UNSIGNED_SHORT with normalization on/off is done in sofware, GL_SIGNED_SHORT with normalization on is also done in software, and ONLY GL_SIGNED_SHORT with normalization off is done on the GPU without any drop-downs!
So, anybody has an advice, can I believe in the happy future? I don’t think, that is is so hard to implement it, because UNSIGNED_BYTEs are mapped fast.
Thanks in advance!

=== Simon Green
I don’t think we support shorts as a vertex type natively in hardware. Use bytes, floats or half-floats.
Next generation hardware may be more flexible in this regard, but I wouldn’t count on it.

=== Jackis
Thanks, Simon!!!
Actually, you do support shorts )) But only signed and not normalizaed, so I have to normalize them in the shader. Longs are not supported, neither signed nor unsigned, neither normalized nor unnormalized )))

Hey guys! Look what I found:
An article about Quake3 Engine that talks about what we were discussing. Quake 3d Engine Optimization

Color Arrays are passed as unsigned bytes!:

GL_VERTEX_ARRAY is always enabled, and each vertex will bet four floats. The fourth float is just for padding purposes so that each vertex will exactly fill an aligned 16 byte block suitable for SIMD optimizations.
Is that true?! Do you recommend passing 4sizeof(GLfloat) instead of 3sizeof(GLfloat) when passing 3vertex?
Why should this speed up!? I really don’t understand that explanation…
Thanks so much in advance!
Cheers
Rod

CVA are no more used in recent hardware.

It could speed things up depending on the graphic card. Memory alignment is the main word here. If you pass 4 32 bits values, then each value will be aligned in whether the memory segment is 32/64/128 bits. When you pass 3 values, each new vertex won’t be align to a new segment. That’s all.

[b]

[quote] GL_VERTEX_ARRAY is always enabled, and each vertex will bet four floats. The fourth float is just for padding purposes so that each vertex will exactly fill an aligned 16 byte block suitable for SIMD optimizations.
Is that true?! Do you recommend passing 4sizeof(GLfloat) instead of 3sizeof(GLfloat) when passing 3vertex?
Why should this speed up!? I really don’t understand that explanation…
Rod [/b][/QUOTE]At time when the Quake3 engine was new, most of the cards did not have HW acceleration of vertex transforms so they were calculated on CPU. Ideally by use of SSE or similiar AMD instructions. Many float SSE instructions are designed to operate on four floats at once and there is performance penalty for memory access if four floats are read/written from/to memory if access is not at address aligned to 16bytes. This is what John Carmack is talking about.

That’s true. Q3 came out at 2000 I think.
CVA is certainly archaic. Use VBO if you want your stuff to be in VRAM. Tell it that your VBO is static.
Additinally, glDrawRangeElements is prefered over glDrawElements.
The vertex format can be xyz, but your entire vertex should be multiples of 32 bytes.

I have a GDC paper
GDC2004_PracticalPerformanceAnalysis.pdf and some other pdfs that mention this number.

Originally posted by Humus:
[quote]Originally posted by Rodrix:
[qb]Could you explain what is fillrate limited?

Basically you’re limited by the amount of pixels rendered, rather than the amount of vertices.

[/QUOTE]Humus thanks for your replys,
The cookie metaphor was great :wink:

I implemented the various suggested improvements (Unsigned byte for color, and gllockarrays) and now I am working on the fill rate limitation, that appears to be the main bottleneck. What are any advices to improve this?:
-One I guess is to reduce the texture loading used in all my program, which I am doing now.

Any other suggestions?

Thanks so much!
CHeers,
Rod

V-man wrote:
The vertex format can be xyz, but your entire vertex should be multiples of 32 bytes.
Perhaps this is just a slight misunderstanding, but that’s not what the document says.

It says that if you’re shuffling data over AGP, use “multiples of 32 byte sized vertices”.

The way I read it, is that if you’re on AGP (and now we’re venturing way outside OpenGL and into specific hardware optimizations for bus-transactions - for a bus that’s being phased out) you should submit your data in a form allowing to maximize the throughput of that particular bus. I.e. if you have only xyz in your vertices, it means you should submit them in batches of multiples of 8 vertices (8*12 = 96 bytes = 3 bus transactions, 32 bytes each).

If supporting older (e.g Radeon 7000, TNT2) or less specialized (such as Intel integrated 915, 925 or similar) h/w, that performs TnL using the CPU, submitting data well aligned for the CPU will affect that/those stage(s) of the pipeline, but besides CPU and/or system RAM specific behaviours, I think todays and yesterdays GPU’s (back to Radeon 9200 and GeForce… 1?) handles just about any 32-bit aligned data the same (someone with insight here; feel free to chime in if this assumption is wrong).

What can matter is the alignment of the starting address (in system memory) of a submitted batch of data. Using immediate calls (glVertex & co) the driver should handle this. Mapped buffers should already be page aligned (and on Windows, due to the way its memory manager works, I’m almost 100% sure you’ll even get them 64KB-aligned).

I think that leaves only the “upload” style functions (e.g. BufferData), where source data could be mis-aligned from a cache-line, bus transfer, or even DMA perspective.

By that, I think I’ve left “off-topic” in the dust for this thread, so I’ll stop here. Just to round off, I’m not saying alignment isn’t still an issue, but something tells me it often isn’t the AGP memory transaction requirements that is the issue, anymore. :slight_smile: