Use of C++ structures with Opengl?

struct{
float a; // 32 bits
byte b; // 32 bits, although it only uses 8
float c; // 32 bits
}

for sure that is not correct with Visual C++, GCC or Intel C++ compiler… or all drivers I made until today would not work.

MSVC 6 will add three bytes between b and c. Unless you alter the packing manually of course.

It may be… the only driver I made with MSVS 6 was made to map 16 bit registers. But GCC and Intel C++ wont do that (just checked by looking at ASM generated)

WOW, Well, i either just spent ALOT of time fixing something to the point that its broken, or making it better. Im not quite sure which now. But from what I have grasped, if i simply use a struct{float, float, float}. Then I should be able to do this. Expecialy if i use sizeof(PointStruct) as my stride. Is this correct?

Im also looking at the fact that i create a pointer to begin with. Then using that pointer i Dynamicaly allocate an ARRAY of PointStructs. Thus those memory addresses are saved all inline. Thus if i simply use sizeof() to know my stride i can still do this. My only consern is the padding, imposed by the compiler. But you have said in 32 bit systems, using floats this should not be a problem. But on a 64 bit system, i will be a full 32bit pad off. Currenly most platforms are 32bit, so i should be fine. Is this correct???

Originally posted by LostInTheWoods:
WOW, Well, i either just spent ALOT of time fixing something to the point that its broken, or making it better. Im not quite sure which now. But from what I have grasped, if i simply use a struct{float, float, float}. Then I should be able to do this. Expecialy if i use sizeof(PointStruct) as my stride. Is this correct?
Yep.

Im also looking at the fact that i create a pointer to begin with. Then using that pointer i Dynamicaly allocate an ARRAY of PointStructs. Thus those memory addresses are saved all inline. Thus if i simply use sizeof() to know my stride i can still do this. My only consern is the padding, imposed by the compiler. But you have said in 32 bit systems, using floats this should not be a problem. But on a 64 bit system, i will be a full 32bit pad off. Currenly most platforms are 32bit, so i should be fine. Is this correct???
Power-of-two boundaries are probably safe.

Ie even on a 64bit system (like upcoming x86-64) a struct with two float members won’t contain any padding.

AMD like to call that ‘natural alignment’. Your data members only have to be aligned to a sizeof(type) boundary.

eg
struct {
ubyte a,b; //offsets zero, one
ushort c; //offset two
float d; //offset four
double e; //offset eight
};

This case is mostly guaranteed.

The problem is this:

struct {
short a;
float b;
short c;
};

a will be at offset 0, b will be at offset four (!!), but c can go either to offset two or two offset 8, depending on compiler ‘cleverness’. If the compiler ignores ‘natural alignment’ they may even be in order.

So to make life a little easier,
1)Excplicitly arrange for natural alignment
2)Sort members largest to smallest wherever possible

Strategy one would yield
struct
{
short a;
short c;
float b;
};
Works fine.

Strategy two (compiler paranoia) would yield
struct
{
float b;
short a;
short c;
};

Thanks, and good night

but c can go either to offset two or two offset 8, depending on compiler ‘cleverness’

I believe the specification requires that members appears in memory in the same order as they are specified in the structure. Therefore the compiler is not allowed to place c between a and b. Not sure, but almost.

Originally posted by OldMan:
[b]
Architecture dependent boundaries in x86 is 8bit… not 32…

32 bits packages are faster than 24 bit ones to move… but not faster than 16 bits or 8 bits in a x86 processor.
[/b]

I’m not questioning your programming skills (I’ve never programmed drivers in win32, only DOS and very simple ones). (OT) Is there any source of information on the net that can help me start playing with that? I miss programming my SB Pro

IIRC from the Intel docs, any memory access will read 32 bits from memory and is 32 bit aligned. If, by any chance, any variable (16 bits or more) crosses a 32 bits boundary 2 memory reads are necessary. This will make 16bit operations slower than the equivalent 32 bit when aligned.

I’m not sure if 8 bits vars should be aligned this way (since they always only need one mem access), but if the next member crosses the boundary padding will be used like in the example I’ve shown.

About the order of struct members: I’ve never come across a case where the compiler rearranged the order for me. Maybe this only happens when you optimize for size instead of speed.

Originally posted by t0y:
IIRC from the Intel docs, any memory access will read 32 bits from memory and is 32 bit aligned. If, by any chance, any variable (16 bits or more) crosses a 32 bits boundary 2 memory reads are necessary. This will make 16bit operations slower than the equivalent 32 bit when aligned.
What did you mean there?

If all types are naturally aligned, the shorter types are better. They potentially conserve cache space and on require less memory reads. Might work out equal but they are never worse. It’s also very nice if you just have to fit some data structure into a given alignment (say, you want each object in a large array to occupy 64 bytes and be perfectly cache line aligned).

I also guess that Intel doc maybe referenced 32 bytes, not bits? That would be the P3 data cache line size.

Or did you mean

Ox00 .
0x01 .
0x02 .
0x03 X
0x04 X
0x05 X

^^ that sounds like that first sentence (16 bit value crossing a 32 bit boundary but the next sentence puzzled me

>>>This will make 16bit operations slower than the equivalent 32 bit when aligned.<<<

OK, then 32 bit accesses that are not aligned will also cause slowdowns.

Also, I think there is something seriously wrong with this statement. When you allocate memory, very often you dont get a 32 bit aligned memory. Pretty much 99% of apps out there are not making sure their memory accesses are aligned.

With SSE and MMX, not having 16 bit aligned will lead to slow downs. At least I think it was 16 bit. I think that’s what you were thinking of.

V-man

I’ve never seen a compiler re-order fields in a struct. I don’t know whether that’s even legal according to the C spec.

Typically, each machine architecture has an alignment requirement to be able to load from memory into a register. If your data doesn’t meet this alignment requirement, you’ll either trigger an alignment exception (very slow), or just fail, depending on hardware and/or OS.

For example, on 32-bit PowerPC systems, all loads to integer registers must be on 32-bit boundaries, all single-precision floating point loads must be on 32-bit boundaries, all double-precision floating point loads must be on 64-bit boundaries, and all vector loads must be on 128-bit boundaries. Integer and FP loads will generate alignment exceptions handled by the OS if their address is misaligned; the vector unit chooses the closest 128-bit aligned address and silently loads from that.

The compiler will pad the fields of your struct so that they meet the alignment requirements of your architecture. For example, in the case already mentioned with a n int, a char, and another int in a struct in that order, there will be three bytes of padding around the char on almost all current architectures (including Intel, PowerPC, 32-bit MIPS, …). Which side of that char the padding goes on depends on the endianness of the machine in question, so you shouldn’t rely on it in any way.

I can’t speak to other compilers in this regard, but GCC aligns the entire struct based on the alignment of the first field. That means that the length of the struct will be padded to be a multiple of the alignment requirement of the first field. For example, a struct containing a double followed by a float will have 32 bits of padding at the end.

In summary, the specific case of { float, float, float } should be 12 bytes on most current common architectures, but you can’t really rely on that. To be safe, you should use arrays of GLfloats (since there’s no guarantee float maps to GLfloat even).

I think that structs get padded, but whether they are aligned properly, I dont know. Its pretty compiler depended I guess.

When I did some SSE coding, I had exceptions beeing raised on movaps instructions and when I checked the addresses, they werent 16 bit aligned. I had to search for the nearest aligned location in the large array that I allocated (using new). The array type was float (32 bit float), but I still had to make sure I was getting proper alignment.

I use VC++ 6. WHat’s that easy to use compiler flag again?

V-man

Anyone who has done SIMD/SSE programming knows how #pragma align works. Its a requirement of the fast versions of the SSE read/write instructions. Padding in a struct is not garunteed but predictable. You are only compiling for one platform anyways. Usually ( :

Devulon

This is getting a bit OT.

Anyway, the variables in a structure will always stay in the same order you defined them in. But the compiler is free to align them as it want.

I think #pragma align is specific to MSVC, so you should use __mm128f and so when programming in SSE 'cause it automaticly takes care of the alignement.
When trying to align an array there is _mm_malloc() and _aligned_malloc() depending on your compiler, it’s not portable of course.

But anyway, to come back to the topic, I don’t really see what’s the point to use an array of “Vector3f-like” structures instead of simply using an array of floats; really, I mean it brings more troubles than it helps.
Like OneSadCookie said, an array of GLfloats should be used.

[This message has been edited by GPSnoopy (edited 11-12-2002).]

In any case, padding your vector3 out with a fourth element will usually make things more efficient for CPU computations, rather than less (better cache-line alignment).

The GL drivers will be optimized for this case, too, because that’s how Q3A submits its vertices.

So your saying i should add another component to my Vector stucture to make it faster? How does that work?

The cache works is blocks, not sure about the size on the common architectures, but I believe 32 and 64 bytes per block are usual. If a block is not in the cache (I’m talking about the processor cache by the way), you have a cache miss and the processor has to access the main memory to fetch this block. This can be expensive. If you keep your data structures of a size such that N structures can fit exactly in a cache block, you can with proper alignment store your structures so that they never cross two blocks. If they cross two blocks, you will have two cache misses if that single structure is not in cache (note, if one of the blocks is in the cache because of a previous access, there’s only one miss, but that’s one too much anyway). Adding an extra element to the structure can save expensive cache misses.

Originally posted by Bob:
The cache works is blocks, not sure about the size on the common architectures, but I believe 32 and 64 bytes per block are usual. If a block is not in the cache (I’m talking about the processor cache by the way), you have a cache miss and the processor has to access the main memory to fetch this block. This can be expensive. If you keep your data structures of a size such that N structures can fit exactly in a cache block, you can with proper alignment store your structures so that they never cross two blocks. If they cross two blocks, you will have two cache misses if that single structure is not in cache (note, if one of the blocks is in the cache because of a previous access, there’s only one miss, but that’s one too much anyway). Adding an extra element to the structure can save expensive cache misses.

Sorry, but that’s complete bogus.

First of all you get 33% more storage eaten up by the data. That alone causes 33% more cache misses. Not that cache misses would matter in streaming type stuff anyway …

Proper SSE or 3DNow optimized code can handle arrays of 3 element vectors without penalty. If you’re not running software T&L this doesn’t even matter either …

Bottom line: you’re wasting memory and bandwith for nothing.

Originally posted by OneSadCookie:
I can’t speak to other compilers in this regard, but GCC aligns the entire struct based on the alignment of the first field. That means that the length of the struct will be padded to be a multiple of the alignment requirement of the first field.

What C spec says here is that compiler must ensure sizeof(struct) is padded so that the struct could be safely used as a element of an array.
So, when you declare ‘my_struct arr[10];’ then at any index ‘i’ all members of ‘arr[i]’ must be properly aligned, according to rules you described.

First: MSVC by default will align on natural alignment up to 32 bits, and thus will pad float-char-float with 3 bytes between the char and the float. This can be changed with compiler options and/or #pragmas. GCC has something similar with its type attribute syntax. Just running any type through the compiler will show you this.

Program:

#include <stdio.h>

struct foo {
float a;
char b;
float c;
};

int
main()
{
foo * f = 0;
printf( "%lx, %lx, %lx, %x
",
&f->a, &f->b, &f->c, sizeof( foo ) );
return 0;
}

Output (both using GCC 3.0.3 for i686 and MSVC 6.0 sp5):

0, 4, 8, c

Whoever said it didn’t and he’d just checked, obviously hadn’t, or checked something totally different (it was kind of vague, that comment).

Second: the native alignment size on any modern x86 is 32 bits, and smaller alignment may cost you performance. malloc(), as implemented in the MSVC runtime library, will do its darndest to return to you 32-bit aligned data. If you “randomly” access data (rather than streaming through it), it makes sense to pad out to powers of 2, and make sure your array starts on the same alignment, to be cache optimal about your accesses.

Third: 16 bits accesses are typically slower than 32 bit on a modern x86, because, if nothing else, each instruction using 16 bit registers needs a size prefix byte. Also, on not-so-modern x86 processors, you get a partial register stall if you mix 32 and 16 bit mode code without properly indicating that you don’t care about the upper bits by clearing the register with xor reg,reg.

Fourth: It is not possible to get optimal SSE throughput using non-padded 3-element vertices. The shuffle instruction will tie up the SSE execution unit for three (3!) cycles, and is thus more expensive than add or multiply. (Btw: P-III can only decode a single SSE instruction per clock, and Athlon XPs aren’t any faster at SSE than at regular FP :frowning: )

There may be cases where you can code your loop to “just work ™” with 3-interleaved vertex arrays, but that’s not the norm. If you can save memory and still be efficient, by all means, do so, but there’s many cases where padding actually does matter; all depending on your data access pattern.

Fifth: writing 3-aligned float triplets to AGP memory is a recipe for disaster, as the Pentium III has only 6 line fetch buffers (doubling as write combiners) and will evict a partially filled one at first hint of running low; thus, you REALLY want to be writing full, aligned 32-BYTE quantities at a time when going to AGP memory. No can do if your input or data format is only 12 byte aligned. Well, unless you write 96 bytes at a time, but at that point, you’re all out of LFBs to get data in from L2 or RAM in the first place…

Now, let’s return to our regularly scheduled hand-wringing over the total absence of released OpenGL drivers supporting ARB_fragment_program in hardware.

Sorry for being such a pain

but

2)Yup, I was talking about streaming

3)MOVZX - reads 8 or 16 bits, writes a full 32 bit register, thus no PRS. Every non-half dumb compiler should use it nowadays.

4)I basically concur. It depends, heavily. Something worth trying is instead of creating permutations of the stream you can prepare the permutations of the static data (matrices typically). This will be too much for the register file and will take reg,mem instructions instead of reg,reg. No worries though, it’ll easily stay resident in the L1 cache.

5)See #4. Also try and do computations to temp memory (read: small L1 cache buffers) and shovel them out via MMX.