I have a performance intensive fragment shader, and want to optimise the speed. I have various ways of performing some of the maths, and was wondering if the trade-offs were obvious (before I go and try it myself).
Does anybody have any information about the relative time cost of sqrt() vs acos()? I presume they are both table lookups, so don’t have much of a hit.
Similarly, does anybody have any information about the relative time cost of sin/cos/sqrt etc. vs adds, muls and the like? How many muls does one sqrt cost?
Lastly, does anybody have any information about the relative cost of texture lookups? Precomputing functions into textures for lookup is often done to increase performance, but how complex do the functions have to be before this is worth doing?
Thanks in advance,
i’ve been playing with nvidia’s “fx composer” (very cool), which has a shader perf window with an asm dump, plus instruction and cycle counts. not for glsl directly, but nvidia uses the cg compiler for glsl so it might give you a good ballpark for isolated functions like this.
you can use nvidia performance tools without composer, but it’s kinda handy
as for glsl, there’s nothing in the spec that i can see that would make such an analysis possible. perhaps i missed it.
nvidia has traditionally favored the lut, while ati prefers the math. alas, it’s an implementation thing.
the standalone nVidia shader performance tool http://developer.nvidia.com/object/nvshaderperf_home.html supports GLSL.
The cost of the instructions can be evaluated only in context of the entire shader and on specific HW and driver. For example while some operation may take long time, the compiler may be able to schedule instructions in such way that calculation is done in otherwise unused unit simultaneously with another necessary calculation or using some fast path in hw so part of the time cost of that instruction may disapear or that instruction may be even free.
The same thing goes for texture sampling. If your shader is already texturing heavy then even a complex calculation may be better than storing function inside texture while if your shader uses only small ammount of textures the fetch may be better than calculation. This also depends on way your function parameters change across the rendered primitive. If they are very random the sampling may kill the texture cache and caculation may be better even if for smothly varying parameters the sampling would be better. And of course different HW may behave in exactly oposite way in your shader.
As Humus pointed out, sampler lookups can be free a lot of the times, so it “can” be a good idea to store complex functions as sampler lookups. You’ll have to evaluate your bottlenecks (keeping target hardware in mind) before resorting to that.
Other than that, i have noticed that a lot of the time you can get away with interpolated results in the fragment shader with little or no loss in accuracy. GPU programming guide on nVidia’s hardware is a good reference to the general “rules of thumb” while doing shader optimizations.
Thanks for all of that - very useful information and much appreciated advice.
This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.