SIMD data structures to OpenGL VBOs

Recently, I’ve been attempting to take some of my code over to SIMD (specifically, SSE). SSE is perfect, as I have a great many functions doing math on all the vertices in a mesh.
My problem is how to organize the data. The Intel documentation I’ve been reading indicates that the obvious array-of-structures (AOS) data format is not at all the most efficient one to use with SSE, but rather a structure-of-arrays (SOA) format is better, or even a hybrid data format. The problem being that the suggested SOA and hybrid formats are very different from the AOS format of VBOs. So, what’s the recommended route to take?

You can’t store x components separately from y components with vertex arrays, so your output needs to be interleaved (AoS). Intel might wish the world to be SoA, but by and large, it isn’t.

You can still use SSE. Just make sure to put enough padding (or other data) in to make sure all the important bits (vertex data, normal data, etc) are 16-byte aligned.

A vertex-by-matrix multiply with the appropriately swizzled data is really quite simple:

  1. move matrix into emm2, emm3, emm4 and emm5

  2. move vertex into emm0

  3. xor emm6,emm6

  4. move emm0 to emm1

  5. swizzle emm1 to broadcast the “x” component to all 4 components

  6. multiply emm1, emm2

  7. add emm6, emm1

  8. move emm0 to emm1

  9. swizzle emm1 to broadcast the “y” component to all 4 components

  10. multiply emm1, emm3

  11. add emm6, emm1

  12. move emm0 to emm1

  13. swizzle emm1 to broadcast the “x” component to all 4 components

  14. multiply emm1, emm2

  15. add emm6, emm1

  16. add emm6, emm5 (defaulting to “1” for w)

Note that swizzles have a 3 instruction latency, and multiplies and adds each have a 2 instruction latency, and the multiply unit can get through one emm multiply per 2 cycles, and the adder can get through one emm add per 2 cycles. With the right scheduling, the inner loop (4-16, plus a store) should run in something like 25 cycles. The hardest thing is breaking the dependency on the swizzles – you can use an extra register to do it. This is assuming it runs out of L1 cache – pre-fetching and streaming stores ought to make sure of that.

The rules I described are Pentium III rules, but result in good performance for other architectures, too.

Thank you for all the detailed information; swizzling it is…

Points for whomever spots the typo :slight_smile:

(Hint: it’s line 14)

Originally posted by jwatte:
[b]Points for whomever spots the typo :slight_smile:

(Hint: it’s line 14)[/b]
It’s full of typos. Should be xmm instead of emm. But you probably mean the 2 vs 4 copy&paste glitch :wink:

Speaking of swizzles, one thing I really miss in SSE and SSE2 is horizontal adds. That’s the most obvious thing to accelerate very common stuff like dot products. But it only made it in SSE3 for some reason. The lack of horizontal adds makes coding SSE much more inconvenient for most task, and you’re forces to swizzle way more than you’d like to. It should have been there from the start, like it was in 3DNow.

The reason horizontal adds wasnt’ in SSE or SSE2 is that the Intel CPUs internally use 64-bit busses, forwards, and register files. This means that operating on an XMM register takes two cycles! (Well, insofar as “cycles” are defined in that architecture… it gets hairy at that level :slight_smile:

Anyway, this means that they didn’t have the necessary interconnect between the two halves of the XMM registers to do a correct dot product efficiently. (On this topic: I think the swizzles are very special – and they take longer because of this)

I whole-heartedly agree that a crosswise add was long over-due. Note that 3dNow only did 2 elements wide, so they didn’t have the forwarding problem, but instead you quickly run out of available registers; it’s not SIMD enough to be worth it IMO.

>>> it’s not SIMD enough to be worth it IMO.<<<<

I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
For vec4, it was just a tiny bit faster.

Have any of you tried this?

When you multiply by more than one matrix (such as when CPU skinning), the shuffles amortize over more operations. The actual case I was timing (starting four years ago now!) was multi-matrix blending and both vertices and normals, doing streaming stores to AGP VAR memory. At the time, we got a nice improvement over all the other mechanisms. (We used a slightly different shuffle, and swizzled the matrix instead when generating, btw, which saved one shuffle)

Also, on Pentium hardware, there’s more of a difference between SSE and x87; the Athlon line is known for being quite good at x87 instruction execution. Which hardware were you using?

Originally posted by V-man:
[b]>>> it’s not SIMD enough to be worth it IMO.<<<<

I have done benchmarking with vec3 dot products. FPU was faster than SSE. SSE wastes time with the shuffles.
For vec4, it was just a tiny bit faster.

Have any of you tried this?[/b]
Well, I don’t get FPU beating SSE, but I often get 3DNow beating SSE, even when I’m working on vec4 stuff and I don’t need many shuffles, such as for instance in a vec4 lerp that you’d think SSE would be faster on.
That’s on an Athlon64, don’t know if it’s any different on Athlon-xp.

Originally posted by jwatte:
When you multiply by more than one matrix (such as when CPU skinning)
I did the test 2 weeks ago, and of course, the SSE version was actually vec4 with w = 0 (or I will have to find a way to discard w)
Since I was using 10 million vertices :
XYZ = 114 MB
XYZW = 152 MB

and it was a Athlon XP.
I know the P4 FPU is weak and probably still is with the Prescotts. Intel wants everyone to use SSE.

So FPU was about 6% faster.

I have a certain algorithm that does a lot of vec3 dot products so I wanted to see if it’s worth going to SSE.

>>>I often get 3DNow beating SSE<<<

Sounds good. By how much does it beat it?

Discarding the “w” means that your source data is not 16-byte aligned, and thus you have to fetch with MOVUPS instead of MOVAPS. That can eat up a noticeable part of your speed, once you’re at the point that memory throughput matters.

We stored a “1” for w for the position, and “0” for w for the normal, and disallowed non-uniform scale in our animations, and interleaved normals and position in one vertex array – I think you can see why :slight_smile:

I did animation playback (hermite interpolation) and skinning palette computation, in C, SSE and 3DNow.

The most benefit when switching to any of the SIMD on any processor sets was to be able to use a streaming store.

The biggest downer when using SSE was that palette matrices had to be transposed after the SSE computation to fit into 3 shader constants each.

Some results from my experience

straight C ~700 cycles/joint
SSE, Pentium4 ~550 cycles/joint
SSE, Athlon64 ~450 cycles/joint
3DNow, Athlon64 ~400 cycles/joint

The work to be done per joint was about the equivalent of 500 SSE instructions.

Originally posted by V-man:
Sounds good. By how much does it beat it?
Depends on what I’m doing, but something like 20% is not uncommon. I just released a demo that does some 3Dnow/SSE/FPU stuff, and with 3DNow it runs about 15% faster than with SSE.

Jesus Christ! I know most of these terms but the processes discussed are like.

( stuff here )

( my head way down here )

I just got my degree a couple weeks ago, but I’ve been programming since grade school, and working with OpenGL for three to four years now. Should I know all this stuff already? I’m always paranoid about being caught way behind on knowledge, since staying with the game is so important in this industry.

I have the full set of architecture manuals from AMD and Intel, covering the AMD64 and IA32 architectures; the AMD book kind of subsumes the 32-bit stuff too. I should probably brush up on my CPU extensions. And my basic assembly too… Ugh, I have so much to learn…

Oh yeah I start my first programming job on Monday too. Bwa ha. Cheers!

Yeah, you’re basically doomed. Spreadsheet macros only for you from now on.

:wink:

Don’t tell me that!

/me hyperventillates

Borderline freakout here… Have you seen the screenshots of AOE3? I’m still working with multitexturing and plain blending… I really need to write some vertex and fragment programs, and then LEARN how to do GLshaders because that’s the way things are going.

And then there’s that other API I should probably learn…

For GLSL, play with Shader Designer at http://www.typhoonlabs.com/
GLSL is much easier than vertex and fragment programs IMHO.

I was kidding. Here’s some career advice:

The carreer path of “graphics guru” is actually fairly narrow, has a lot of competition, and changes every three years.

The carreer path of “engineer with depth in many areas, ability to focus on necessities, and ability to figure out what’s needed to deliver” actually looks a lot better in many cases.

So, don’t sweat the details. Write the code you think is fun. If you’re puttering around with your own hobby stuff, you WILL catch up with the people who do things “for real” because that takes 10 times longer. Once you’re at parity, THEN you do something for real to show that you’re not just all experiments.

Originally posted by Humus:
Speaking of swizzles, one thing I really miss in SSE and SSE2 is horizontal adds.
Heres one potentially faster way to do a horizontal add. It employs the fact that the *ss instructions dont have special alignment requirements/penalties for mem operands:

<source in xmm0.xyzw>
movaps membuf, xmm0
addss xmm0, membuf+4
addss xmm0, membuf+8
addss xmm0, membuf+16
<result in xmm0.x>
 

note that the last addition is not needed for 3-component
operation
of course the membuf would stay in the cache all the time, or better, be forwarded
there is an irony in that working with memory is more flexible than working with registers in this case; surely intel’s designers could have done better job