# Fast normalization - slightly off topic

Is there a faster way to normalize a vector than just?:

len = sqrt(vect.x^2 + vect.y^2 + vect.z^2);
vect.x /= len;
vect.y /= len;
vect.z /= len;

I remember a poster earlier mentioning some faster methods of finding the length of a vector without using sqrt.

This would be faster:

len = 1.0 / sqrt(vect.x^2 + vect.y^2 + vect.z^2);

vect.x *= len;
vect.y *= len;
vect.z *= len;

Multiplies are significantly faster than divides. Of course, the slowest part of the above code is the square root. Nvidia has some fastmath routines on their site that has a faster square root in it. There are also other “tricks” like Nvidia’s floating around the web. They’re not as accurate or as universal as just calling sqrt(), so be aware of this.

– Zeno

void __forceinline __fastcall normalizeASM(float* v)
{
static float f=0;
static const float one=1.0f;
__asm{
mov eax,dword ptr[v]
fld dword ptr[eax]
fmul dword ptr[eax]
fstp f
fld dword ptr[eax+4]
fmul dword ptr[eax+4]
fstp f
fld dword ptr[eax+8]
fmul dword ptr[eax+8]
fsqrt
fstp f
fld one
fdiv f
fstp f
fld dword ptr[eax]
fmul f
fstp dword ptr[eax]
fld dword ptr[eax+4]
fmul f
fstp dword ptr[eax+4]
fld dword ptr[eax+8]
fmul f
fstp dword ptr[eax+8]
}
}

thats the one i wrote… could be faster without the f, but i dont know how to store from the floatreg into a stdregister…
(means fstp ebx, not allowed… ) anyone?

If you use 3dnow or SSE you can do a fast a quite accurate 1/sqrt approximation in only two or so clockcycles.

Thanks for all the info, guys! So, Humus, how would you implement such an optimization? In asm?

Humus -

I’m sure a lot of people on here would LOVE to get a 2 cycle sqrt approx (I would ). Do you know where we could find it?

Davepermen -

Have you done any benchmarking of your algorithm vs. regular sqrtf()? Could you post the results?

Thanks,
–Zeno

Yeah, I’d do it in asm. It would be a quite short function. I haven’t done any 3dnow stuff for some time, so don’t have the instructions name in my head (was something like PFI2D I think … or something other weird). But the 3dnow docs are freely available on AMD homepage and the SSE eqvivalent is free for download from Intel too.

[This message has been edited by Humus (edited 02-12-2001).]

Cool. So now all I need to do is download tens of megabytes of Intel and AMD processor specs, study them, learn assembly, and maybe come up with a mathematical trick for approximating a square root (if such instructions aren’t built-in). That shouldn’t take long .

– Zeno

They are built-in. I could smash up a small piece of code later tonight when I get time …

http://www.opengl.org/discussion_boards/ubb/Forum3/HTML/000211.html

(you may have to reconstruct the link)

Regards

LG

Here is a 3DNow! vector normalization:

#include <AMD3D/amd3dx.h> // 3DNow! opcode macros

void Normalize3f_3DNow(float *vec)
{
_asm
{
femms
mov eax, dword ptr [vec]
movq mm0, [eax]
movq mm3, mm0
pfmul (m0,m0)
movd mm1, [eax+8]
movq mm4, mm1
pfmul (m1,m1)
pfacc (m0,m0)
pfrsqrt (m1,m0)
movq mm2,mm1
pfmul (m2,m2)
pfrsqit1 (m2,m0)
pfrcpit2 (m2,m1)
punpckldq mm2,mm2
pfmul (m3,m2)
movq [eax],mm3
pfmul (m4,m2)
movd [eax+8],mm4
femms
}
}

I believe I simply copied that routine from the 3DNow! SDK. To get maximum performance when working on a bunch of vectors it is much better to use most of the above code inlined and use prefetch (or prefetchw) to fetch the next vector into cache while working with the current vector. Oh and you can lose the pfrsqit1, pfrcpit2, and one pfmul instruction if 15 bit precision is good enough.

[This message has been edited by DFrey (edited 02-13-2001).]

There is a fastmath.cpp source on NVIDIA’s developer relations page containing an approximation method for sqrt(): http://www.nvidia.com/Marketing/developer/devrel.nsf/ProgrammingResourcesFrame?OpenPage

DFrey:
While your code probably works I don’t understand why you start with FEMMS? It should only be at the end. You signal that you’re exiting the multimedia state before you entering it?

No, that’s ok.
Read the 3Dnow! specs coming with the SDK:

“Like the EMMS instruction, the FEMMS instruction can be used to clear the MMX
state following the execution of a block of MMX instructions. Because the MMX
registers and tag words are shared with the floating-point unit, it is necessary to clear
the state before executing floating-point instructions. Unlike the EMMS instruction,
the contents of the MMX/floating-point registers are undefined after a FEMMS
instruction is executed. Therefore, the FEMMS instruction offers a faster context
switch at the end of an MMX routine where the values in the MMX registers are no
longer required. FEMMS can also be used prior to executing MMX instructions where
the preceding floating-point register values are no longer required, which facilitates
faster context switching.

That’s how it was in the 3DNow! SDK. From my understanding, they tacked it onto the beginning just to put the mmx registers into a known (undefined ) state. I understand perfectly why it is on the end, and thought it odd at first when I saw it at the start too. But the white paper on it says the FEMMS instruction is to facilitate “Faster Enter/Exit of MMX or floating-point state”.

Hmm … that’s cool .

One learns something new each day