Mukund
October 20, 2011, 3:02am
#1
Hello all,

This is my fragment shader:

```
varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;
void main(void)
{
float dis = sqrt((fragPos.y - lightPos.y) *
(fragPos.y - lightPos.y)
+(fragPos.x - lightPos.x) *
(fragPos.x - lightPos.x)
);
if (dis <= 5.0)
gl_FragColor = color * vec4(1.0, 1.0, 1.0, 1.0);
else
gl_FragColor = color * vec4(0.5, 0.5, 0.5, 1.0);
}
```

Can anyone please tell me how costly this is?
I just want to check for a circular region and assign colors. Please let me know if there is a better way to do the same.

Thanks a lot!

aqnuep
October 20, 2011, 3:55am
#2
It shouldn’t be that much expensive. Reciprocal square root is actually only a single instruction on modern GPUs, while this means that sqrt() should not be more than two, however it requires division what may be a bit expensive.

You may use inversesqrt() instead and change your comparison but I think you should not be afraid of square root calculation. Actually the branching (if) is much more costly.

Mukund
October 20, 2011, 4:17am
#3
Thanks for the reply aqnuep.

>>Actually the branching (if) is much more costly.

Well, i need to do the check every fragment. Any idea how i can make it better?

Thanks!

aqnuep is absolutely correct, except sqrt() is usually implemented as inversesqrt() + multiplication, not inversesqrt() + division, so it should be quick.

One thing you should keep in mind though is that sqrt is not vectorized on most GPUs, so computing a sqrt(vec4) requires 4*2 instructions.

aqnuep
October 20, 2011, 4:35am
#5

mbentrup:

aqnuep is absolutely correct, except sqrt() is usually implemented as inversesqrt() + multiplication, not inversesqrt() + division, so it should be quick.

Yes, you are right, as usually if the GLSL compiler is smart enough then it may figure out that there is no need for division/multiplication at all, or maybe only a multiplication is enough. Also, it is true that sqrt(vec4) most probably will require 4 instructions for the reciprocal square root, but may not require 4 instructions for the multiplication.

system
October 20, 2011, 5:16am
#6
This is 4 subtractions, 2 multiplications, 1 addition, 1 inversquareroot, 1 inverse (because sqrt might be a inversquareroot followed by a 1/x).

TOTAL = 9 clock cycles

```
float dis = sqrt((fragPos.y - lightPos.y) *
(fragPos.y - lightPos.y)
+(fragPos.x - lightPos.x) *
(fragPos.x - lightPos.x)
);
```

This is 1 subtraction, 1 dot product, 1 inversesqrt.

TOTAL = 3 clock cycles

```
vec2 result = fragPos.xy - lightPos.xy;
float result2 = dot(result, result);
float dis = inversesqrt(result2);
```

and then you change your “if (dis <= 5.0)”

Mukund
October 20, 2011, 5:29am
#7
Thanks V-man, aqnuep, mbentrup.

@V-man

>> and then you change your “if (dis <= 5.0)”

I didn’t quite get you. Change that to what?
Thanks!

aqnuep
October 20, 2011, 6:05am
#8
Maybe you can play with some math like min/max/clamp/ceil/floor to get the values 0.5 or 1.0 based on whether dis is greater than 5.0 or not.

Most probably even multiple ALU instructions will be faster than a conditional.

Mukund
October 20, 2011, 6:27am
#9
Hmm, i came up with this one:

```
float val = step(dis, 5.0);
gl_FragColor = mix( color * vec4(0.5, 0.5, 0.5, 1.0),
color * vec4(1.0, 1.0, 1.0, 1.0),
val);
```

Is this better? step would internally have to do a comparison right? So, is it inevitable that there is a loss of cycles or is it in any way avoided?

aqnuep
October 20, 2011, 6:28am
#10
FYI, Groovounet just posted on twitter about an ALU technique that can be used for conditional elimination: http://developer.amd.com/documentation/articles/pages/New-Round-to-Even-Technique.aspx

Ahah: many people seems to have missed that the build in function mix also have a version with a bool type.

genType mix(genType, genType, genBType);

So this is enough:

```
gl_FragColor = color * mix(
vec4(0.5, 0.5, 0.5, 1.0),
vec4(1.0, 1.0, 1.0, 1.0),
dis <= 5.0);
```

Mukund
October 20, 2011, 7:23am
#12
@aqnuep : Thanks!

@Groovounet : Well GLSL version 1.2(the one i’m using) doesn’t seem to support it. But yeah, i didn’t know we could use mix that way from 4.0 onwards.
Thanks!

Ahhh I didn’t realized that you were using GLSL 1.20.

But then again, gpus generally have predicated-execution .

So, fastest version should be:

```
varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;
void main(void)
{
vec2 tmp = fragPos.xy - lightPos.xy;
float disSq = dot(tmp,tmp);
float col = 0.5;
if (disSq <= 5.0 * 5.0) col = 1.0;
gl_FragColor = vec4(color.xyz * vec3(col), 1.0);
}
```

And if some gpus can do single-cycle compare to 0.0f and conditionally move, then:

```
varying vec4 color;
varying vec3 fragPos;
uniform vec4 lightPos;
// for scalar-ISA gpus
void main(void)
{
vec2 tmp = fragPos.xy - lightPos.xy; // 2 fsub = 2 cycles
float col = 0.5; // mov, 1 or 0 cycles, see below
float disSq = tmp.x*tmp.x + (tmp.y*tmp.y - 25.0); // fmad, fmad = 2 cycles.
if (disSq <= 0.0) col = 1.0; // 1 cycle . Some gpus might merge-in the above "col = 0.5" execution in here.
gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles.
}
```

Things get funny when some gpus can do an fmul and an fmad together in a single cycle, though

system
October 21, 2011, 3:54am
#16
"gl_FragColor = vec4(color.xyz * vec3(col), 1.0); // 3 fmul, 1 mov = 4 cycles. "

That should be 1 MUL and 1 MOV
gl_FragColor.xyz = color.xyz * col.xxx;
gl_FragColor.w = 1.0;

and modern hw supports direct multiple writes to gl_FragColor.

yuriks
October 28, 2011, 7:47am
#18
No one seems to have mentioned it, but op can simply square 5 and compare with 25 instead, doing away with the sqrt entirely.

sqrt_1
October 28, 2011, 5:42pm
#19
I think Ilian Dinev’s code above does this.

system
closed
October 19, 2021, 7:20pm
#20
This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.