Fast normalization - slightly off topic

Punchey · February 12, 2001, 11:03am

Is there a faster way to normalize a vector than just?:

len = sqrt(vect.x^2 + vect.y^2 + vect.z^2);
vect.x /= len;
vect.y /= len;
vect.z /= len;

I remember a poster earlier mentioning some faster methods of finding the length of a vector without using sqrt.

Zeno · February 12, 2001, 11:22am

This would be faster:

len = 1.0 / sqrt(vect.x^2 + vect.y^2 + vect.z^2);

vect.x *= len;
vect.y *= len;
vect.z *= len;

Multiplies are significantly faster than divides. Of course, the slowest part of the above code is the square root. Nvidia has some fastmath routines on their site that has a faster square root in it. There are also other “tricks” like Nvidia’s floating around the web. They’re not as accurate or as universal as just calling sqrt(), so be aware of this.

– Zeno

davepermen · February 12, 2001, 11:25am

void __forceinline __fastcall normalizeASM(float* v)
{
static float f=0;
static const float one=1.0f;
__asm{
mov eax,dword ptr[v]
fld dword ptr[eax]
fmul dword ptr[eax]
fstp f
fld dword ptr[eax+4]
fmul dword ptr[eax+4]
fadd f
fstp f
fld dword ptr[eax+8]
fmul dword ptr[eax+8]
fadd f
fsqrt
fstp f
fld one
fdiv f
fstp f
fld dword ptr[eax]
fmul f
fstp dword ptr[eax]
fld dword ptr[eax+4]
fmul f
fstp dword ptr[eax+4]
fld dword ptr[eax+8]
fmul f
fstp dword ptr[eax+8]
}
}

thats the one i wrote… could be faster without the f, but i dont know how to store from the floatreg into a stdregister…
(means fstp ebx, not allowed… ) anyone?

Humus · February 12, 2001, 11:32am

If you use 3dnow or SSE you can do a fast a quite accurate 1/sqrt approximation in only two or so clockcycles.

Punchey · February 12, 2001, 11:49am

Thanks for all the info, guys! So, Humus, how would you implement such an optimization? In asm?

Zeno · February 12, 2001, 12:01pm

Humus -

I’m sure a lot of people on here would LOVE to get a 2 cycle sqrt approx (I would ). Do you know where we could find it?

Davepermen -

Have you done any benchmarking of your algorithm vs. regular sqrtf()? Could you post the results?

Thanks,
–Zeno

Humus · February 12, 2001, 2:26pm

Yeah, I’d do it in asm. It would be a quite short function. I haven’t done any 3dnow stuff for some time, so don’t have the instructions name in my head (was something like PFI2D I think … or something other weird). But the 3dnow docs are freely available on AMD homepage and the SSE eqvivalent is free for download from Intel too.

[This message has been edited by Humus (edited 02-12-2001).]

Zeno · February 12, 2001, 2:54pm

Cool. So now all I need to do is download tens of megabytes of Intel and AMD processor specs, study them, learn assembly, and maybe come up with a mathematical trick for approximating a square root (if such instructions aren’t built-in). That shouldn’t take long .

– Zeno

Humus · February 13, 2001, 1:41am

They are built-in. I could smash up a small piece of code later tonight when I get time …

lgrosshennig · February 13, 2001, 1:51am

Please have a look at:
http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/000211.html

(you may have to reconstruct the link)

Regards

LG

DFrey · February 13, 2001, 3:01am

Here is a 3DNow! vector normalization:

#include <AMD3D/amd3dx.h> // 3DNow! opcode macros

void Normalize3f_3DNow(float *vec)
{
_asm
{
femms
mov eax, dword ptr [vec]
movq mm0, [eax]
movq mm3, mm0
pfmul (m0,m0)
movd mm1, [eax+8]
movq mm4, mm1
pfmul (m1,m1)
pfacc (m0,m0)
pfadd (m0,m1)
pfrsqrt (m1,m0)
movq mm2,mm1
pfmul (m2,m2)
pfrsqit1 (m2,m0)
pfrcpit2 (m2,m1)
punpckldq mm2,mm2
pfmul (m3,m2)
movq [eax],mm3
pfmul (m4,m2)
movd [eax+8],mm4
femms
}
}

I believe I simply copied that routine from the 3DNow! SDK. To get maximum performance when working on a bunch of vectors it is much better to use most of the above code inlined and use prefetch (or prefetchw) to fetch the next vector into cache while working with the current vector. Oh and you can lose the pfrsqit1, pfrcpit2, and one pfmul instruction if 15 bit precision is good enough.

[This message has been edited by DFrey (edited 02-13-2001).]

Relic · February 13, 2001, 3:09am

There is a fastmath.cpp source on NVIDIA’s developer relations page containing an approximation method for sqrt(): http://www.nvidia.com/Marketing/developer/devrel.nsf/ProgrammingResourcesFrame?OpenPage

Humus · February 13, 2001, 4:26am

DFrey:
While your code probably works I don’t understand why you start with FEMMS? It should only be at the end. You signal that you’re exiting the multimedia state before you entering it?

Relic · February 13, 2001, 5:37am

No, that’s ok.
Read the 3Dnow! specs coming with the SDK:

“Like the EMMS instruction, the FEMMS instruction can be used to clear the MMX
state following the execution of a block of MMX instructions. Because the MMX
registers and tag words are shared with the floating-point unit, it is necessary to clear
the state before executing floating-point instructions. Unlike the EMMS instruction,
the contents of the MMX/floating-point registers are undefined after a FEMMS
instruction is executed. Therefore, the FEMMS instruction offers a faster context
switch at the end of an MMX routine where the values in the MMX registers are no
longer required. FEMMS can also be used prior to executing MMX instructions where
the preceding floating-point register values are no longer required, which facilitates
faster context switching.”

DFrey · February 13, 2001, 5:41am

That’s how it was in the 3DNow! SDK. From my understanding, they tacked it onto the beginning just to put the mmx registers into a known (undefined ) state. I understand perfectly why it is on the end, and thought it odd at first when I saw it at the start too. But the white paper on it says the FEMMS instruction is to facilitate “Faster Enter/Exit of MMX or floating-point state”.

Humus · February 13, 2001, 9:41am

Hmm … that’s cool .

One learns something new each day