SIMD data structures to OpenGL VBOs

l_belev · January 9, 2005, 4:33am

Noticed the typo?

zeckensack · January 9, 2005, 4:38am

Originally posted by l_belev:
Noticed the typo?
Yes
+12

l_belev · January 9, 2005, 5:14am

right

Zengar · January 9, 2005, 9:12am

Athlon XP supports SSE as so called vector-path opcodes which are fairly inefficient commands, that’s why your FPU was faster then SSE. On Athlon XP 3dNow! should be used.
However, Athlon 64 has native support of SSE&SSE2 so it would be faster to use this opcodes instead of FPU/3dNow!

At least i think so

Humus · January 9, 2005, 12:24pm

Well, that’s not what I’m seeing. 3DNow consistently beats SSE on Athlon64. Even when you’re working with vec4s and there are many more instructions in the 3DNow path.

imported_jwatte · January 9, 2005, 4:37pm

I’m not surprised that Athlon 64 runs 3DNow fast. It’s their instructions – for sure, nobody else will do it!

The real question is in what cases the Athlon 64 with 3DNow out-runs a Pentium 4 EE with SSE3, and vice versa. You can bet Intel implements SSE as well as they can.

Humus · January 10, 2005, 8:00pm

Yup. In my experience SSE on P4 is roughly the same speed as 3DNow! on Athlon64 when comparing similar CPU speeds (like 3.2Ghz vs 3200+). In my MetaBalls demo, that relies heavily on SSE and 3DNow for performance, my 3.2GHz P4 laptop runs at 130fps using the SSE path, while my Athlon64 3200+ runs at 125fps using 3DNow and 110fps using SSE. So it’s more like it being slow (well, slower anyway) on SSE than being particularly fast on 3DNow.

Tzupy · January 11, 2005, 12:18am

Hi, I have two points to make:

When talking about Athlon 64 3200, it would be nice to specify which of the three flavors it is:
2.0 GHz, 1 MB, single-channel; 2.2 GHz, 512kb, single-channel; 2.0 GHz. 512 kb, dual-channel.
Even if the integrated memory controller of the Athlon64 is great, one shouldn’t rely heavily on
it: prefetch techniques should be still used, especially for the dual-channel Athlon64s; in theory,
one would then get the most performance by using both SSE(2) and 3DNow.

One more thing: in 64-bit mode, there would be 8 more XMM registers available, and a SSE(2)
performance increase can be expected. Anyone knows something specific on this?

system · January 11, 2005, 5:53am

(like 3.2Ghz vs 3200+)
The AMD is clocked lower, hence it does more work per cycle.
(you didn’t OC did you?)

And the exact system spec will matter cause your metaballs demo may benifit from the cache and such.

I would like to know by what % one is superior to the other (factoring out cache and memory performance)

Humus · January 11, 2005, 7:28pm

Originally posted by Tzupy:
[b]Hi, I have two points to make:

When talking about Athlon 64 3200, it would be nice to specify which of the three flavors it is:
2.0 GHz, 1 MB, single-channel; 2.2 GHz, 512kb, single-channel; 2.0 GHz. 512 kb, dual-channel.

Even if the integrated memory controller of the Athlon64 is great, one shouldn’t rely heavily on
it: prefetch techniques should be still used, especially for the dual-channel Athlon64s; in theory,
one would then get the most performance by using both SSE(2) and 3DNow.

One more thing: in 64-bit mode, there would be 8 more XMM registers available, and a SSE(2)
performance increase can be expected. Anyone knows something specific on this?[/b]
I’m using the 2.2GHz version. In my demo I’m not so much dependent on memory performance but rather on raw computation performance. Extra registers would only help if you need more of them. I also believe the use of the extra 8 registers creates larger code because of a prefix byte, but I’ll have to verify that.

Tzupy · January 12, 2005, 4:48am

Humus, you are correct: there’s a REX prefix involved with the usage of the new registers, but
I doubt it will have a significant impact on performance. Here is my reason for needing more
registers: it is possible to write ASM code that procesess both step n and step n+1, interleaved.
The purpose is to try to hide instruction latency, when your algorithm has step n+1 immediately
dependent on step n. The drawback is that you need double the number of registers you needed
without this interleaving. I am using this technique since I had a 486, many years ago, and was
getting memory limitations on a crude texturing code. More recently, a population count code
implemented with MMX benefitted about 15% from the interleaving, compared with the AMD
implementation in the x86 Code Optimisation Guide (I think I should have done better ).

zeckensack · January 12, 2005, 6:31am

Originally posted by Humus:
Extra registers would only help if you need more of them. I also believe the use of the extra 8 registers creates larger code because of a prefix byte, but I’ll have to verify that.
It’s still better than spilling to memory from a code density pov. Memory references make opcodes longer by at least one byte.

Zengar · January 12, 2005, 8:22am

Typical SIMD instructions with REX prefix will be likely 4 bytes(if my memory doesn’t trick): one REX, two opcode bytes and modrrm byte. So they have a good chance of fitting the decode window of Athlon. Pentiums will have more problems I guess as they have only one decode unit(still if my memory doesn’t trick me ).

I’m just a hobby assembler writer(writing compilers and such stuff) so don’t rely on my words.

l_belev · January 13, 2005, 12:02pm

In pentium4 the prefixes are of concern no more since the processor “re-compiles” the incoming code stream into it’s internal representation and in this form stores it in internal trace cache. So as much as the most of the time-consuming code in the real-life situations is located in loops, which usually completely fit in the trace cache, the original machine code (including prefixes, etc) does not matter at all.
That is with P4, which is intel. I dont know what is the case with AMD, but I suppose they would go the same way. The only obstacle for that i can think of, could be some patent-related problems, but i guess that AMD has enough money to work around such problems.
Generally there’s no need worry about the prefixes. IMO the extra registers are the best thing in AMD64, not the 64 bits.

Humus · January 13, 2005, 8:54pm

Yeah, extra registers are definitely a good thing. Though the need isn’t as dramatic for SIMD as for regular ALU instructions. Back in the old days when I wrote a raycasting engine for the 486 I remember the hell of trying to squeeze everything into the registers. Not only that you only had 8, but you also lost the stack and stack and frame pointer so you essentially had 6. When you do SSE and stuff you can use the xmm registers for the math and ALU for pointers and counters etc. so you have a lot more freedom then when have to do both the math and increment counters and pointer with only 6 registers (or 7 if you compile with frame pointer omission), so it’s not that often (at least for the stuff I’m doing) that I really need more registers. For compiled code though the extra ALU registers will probably do wonders for performance.

xanatose · January 14, 2005, 6:21pm

Since SSE is offtopic from openGL, but the topic is already being discussed here, I wondered if someone would know a link to a good tutorial on SSE and SSE2.

I know how to use asm up to Pentium, but dont know my way with MMX,SSE and SSE2. I downloaded the manuals on intel, but would really need a good tutorial on the subject to get up to speed.

Tzupy · January 14, 2005, 10:25pm

I’m not sure about a tutorial on SSE, but if you download the AMD x86 Code Optimisation Guide you’ll find examples of code implemented with MMX and 3DNow. There are also several Intel papers, like ‘Application tuning for SSE’, ‘Antialiasing implemented using SSE’, etc.

imported_jwatte · January 16, 2005, 8:09pm

Humus: you don’t need the frame pointer, because you can get all your locals off the stack pointer. So we’re up to 7 registers.

Then, you don’t need the stack pointer, if you store it in a global variable, and turn off interrupts

Humus · January 22, 2005, 9:40am

Yeah, that’s what I said (“7 if you compile with frame pointer omission”).