Fast sqrt implementation

Hi,
I am trying to implement fast sqrt algorithm:

It’s C++ code:

#include <stdio.h>
#include <math.h>
 
float f(float n) {
    n = 1.0f / n;
    long i;
    float x, y;
 
    x = n * 0.5f;
    y = n;
    i = *(long *)&y;
    i = 0x5f3759df - (i >> 1);
    y = *(float *)&i;
    y = y * (1.5f - (x * y * y));
 
    return y;
}

How can I write this in glsl? Since, long is not a valid data type in glsl.

  1. That algorithm calculates the reciprocal (inverse) of the square root.
  2. It’s likely to be significantly slower than just calling the GLSL inversesqrt function. FWIW, it’s also likely to be slower than just using 1.0f/sqrtf(x) on any modern CPU.
  3. The appropriate type is int. The original code was written for DOS where int was 16 bits and long was 32 bits. In GLSL, int is 32 bits.
  4. GLSL doesn’t support pointers, so you can’t perform type-punning like that. However, it does have intBitsToFloat and floatBitsToInt which achieve the same result.
1 Like

Thank You very much for your reply. I was trying to use this instead of sqrt() function in GLSL. Is it possible to speed it up using any other algorithm?

I guess I framed my last question wrong.
So, if I use sqrt() function , is it faster than using this algorithm even if I use inversesqrt() function?

Use the functions. If you want to multiply by the square root, use sqrt. If you want to divide by the square root (which is probably the more common case), multiply by the result of inversesqrt (if it’s available; it was added in GLSL 1.3).

I implemented it this way:

float sqrt_new(float n) {
    float i;
    i = inversesqrt(1.0f/n); 
    return i;
}

But, this didn’t improve the speed greatly as compared to the sqrt() function.
Another piece of code I found online was using assembly language.

double inline __declspec (naked) __fastcall sqrt14(double n)
{
	_asm fld qword ptr [esp+4]
	_asm fsqrt
	_asm ret 8
} 

Is it possible to write assembly language in glsl?

It wouldn’t. If you want the square root, use sqrt; you won’t find an alternative that’s faster. inversesqrt exists if you want to divide by the square root; division is often slower than multiplication, so y*inversesqrt(x) may be faster than y/sqrt(x).

No. Although you can view a disassembly of a shader with e.g. AMD’s Shader Analyzer or Nvidia’s nSight.

1 Like

Beating a compiler’s optimizer for a CPU is already a task that most people fail at. For the majority of simple optimizations, compilers already know how to get it done.

Beating a GPU compiler’s optimizer is going to be so much harder. In the desktop space, x86 is dominant, and optimizing for AMD is mostly not much different from optimizing for Intel. With GPUs… that isn’t the case. Different GPUs can be incredibly different across companies. GPUs from different generations (or even within the same generation for cheaper/more expensive hardware) can be radically different. What was faster on one may be incredibly slow on another.

Furthermore, by the nature of how GPUs do their work, you have to take a holistic approach to optimizing if you ever want to get anything done. You can’t just optimize sqrt; you need to optimize when you do each of the individual computations. Some GPUs can hide scalar operations by bundling them with some existing vector operation (if the vector operation is 3 component or less). That would make the sqrt computations take no extra time at all, but it requires finding an appropriate vector operation to hide it within. And that depends on how the rest of the shader is written.

You might hide it in a dot product in one shader, a cross product in another shader, a matrix multiply in a third, etc.

If such optimizations are possible, your GPU compiler is likely very good at finding ways to rearrange the shader code to hide the computation. Unless you are an engineer at AMD/NVIDIA/Intel, you are not going to beat the compiler in the vast majority of cases, even if you had access to their assembly.

In short, you’re not going to beat the compiler at generating code for the various GLSL standard functions. So just use them and stop worrying about it.

1 Like

The first question I always ask is can I re-arrange the math to use x^2 instead of sqrt… The next option is an approximation using a limited iteration approach, sqrt will return the most precise answer required but you may not need that.

Search for Newton-Raphson Method:

See https://medium.com/@parthipannatkunam/square-root-approximation-techniques-fcfd46ed0e5c#:~:text=%20Square%20Root%20Approximation%20Techniques%20%201%20Exhaustive,using%20the%20bisection%20search%20technique%2C%20we...%20More%20

sqrt is generally going to be pretty fast in GLSL, probably faster than most approximation techniques you could code manually.

While I’m sure those algorithms are fine for CPUs (though the fact that you don’t show a performance test of this vs. std::sqrt is concerning), for GPUs they’re not really appropriate. The fact that they involve conditional branches at all (even a simple one) is troubling from a GPU perspective.

From my knowledge of GPU’s a SQRT function is just an unrolled loop to get results as close as needed to the specification for SQRT… if you can live with lower precision write your own unrolled loop, I have done this and it works just fine. Transcendental, SQRT, POW are all done the same with with unrolled iteration loops.

But I always try to eliminate SQRT, POW and any other time wasting functions from algorithms as the first step.

As an example there is no difference between comparing magnitudes if SQRT is left out…