GLSL function execution time estimation

On the first glance the following question looks silly, but after trying to answer on it I realized it is quite difficult (or even impossible).

How can we estimate execution time of some function inside a shader?

I have made a sample vertex shader like this:

#version 330
out  float out_val;
void main(void)
    out_val = someFun(gl_VertexID * 1e-6);

allocate 80MB buffer for transform feedback, embrace glDrawArrays() with glQueryCounter()

glQueryCounter(m_nStartTimeID, GL_TIMESTAMP);
glDrawArrays(GL_POINTS, first, count);
glQueryCounter(m_nEndTimeID, GL_TIMESTAMP);

and call it for count=1e7.

Can you guess what happens? Elapsed time does not depend on the complexity of the function. There is a fixed portion for setup (about 14.7us on my laptop) and a portion that directly depends on the number of vertices (about 22.5ms for 1e7 vertices).

Does anybody have any suggestion on measuring GLSL function execution time?

In fact, I need to compare efficiency of some implementations. So it is not important to have absolute values. On the other hand, I don’t want to measure execution time of the application when they are applied, since it is quite specific and subjected to optimization related to certain implementation.

Thank you in advance!

Measuring the performance of an independent function in a vacuum is pointless. GLSL is not C, where you could expect the performance of a particular function to be invariant with other changes. In shader compilation, functions will be inlined, instructions will be statically reordered to hide latencies for various operations, and so forth.

You can never assume that a function X which is faster than function Y in your vacuum test will always be faster in your application. Once you put it in your real shader(s), it may be faster or it may be slower.

For example, let’s say you have some function that does purely math stuff. So it has some particular performance X. And let’s say you have another function that does a texture fetch, then a small number of math computations. It has some particular performance Y.

It is entirely possible that, when you call one after the other in your real shader, the overall performance is not X+Y. It could just be as small as max(X, Y).

So the exercise you propose is simply not useful. If you want to optimize a shader, you’re going to have to do so in the actual context of the overall code you’re trying to make faster. The only thing you can test is how long it takes to execute it.

Also, if you’re measuring shader performance, why are you using transform feedback?

Can you use Nsight from nVidia? It gives you the cpu and gpu time for every OpenGL call
It looks like this (but better formatted:biggrin-new:)

event Description CPU Duration (ns) GPU Duration (ns)
1523 glDrawArrays(GL_TRIANGLES, GLint first = 0, GLsizei count = 54) 17685 4192
1524 glBindVertexArray(GLuint array= 0) 1153 0

Also Alfonse it right; I cannot believe how much overlapping gpu instructions can do - I grew up when cpu’s only execute 1 instruction at a time that is not true any more.

Thank you, guys!

My question was an a consequence of the late-night desperate thinking.
In fact, the case is quite clear.

If there are no dependences and pipeline stalls, only parameter I could measure is a single-step interval.
Let’s assume we have M processing units and want to execute N function calls.
The whole processing time is equal to:

setup_time + (ceil(N/M) - 1) * single_step_time + full_pipeline_execution_time

setup_time - constant time that does not depend on the problem size
single_step_time - single clock interval
full_pipeline_execution_time - function execution time

Considering above, I could calculate function execution time, but it is very short interval (far below 1us) that cannot be measured precisely.

Just to be sure that all 1e7 executions are done correctly. :wink:

Yes, but the result will be the same. :slight_smile:

Thanks again!
Conclusion: Only thing we can measure is a pipeline stalls and dependences, not the execution time!

If the difference is too small to measure you could loop inside the shader ( but I would make sure something changed or the compler might optimise the loop away).

If you don’t use nSight, I’d suggest you use a fragment shader on a fullscreen triangle. (to avoid high primitive-setup costs, and <= 4 primitives per cycle setup, and transform-feedback setup/memwrite). With manually-unrolled looping and care to not let the compiler optimize stuff out.
Things that can skew results are texture fetches (longest stall), access to limited ALU units (trigonometry), and register bank clashes (fmad r0, r4, r8, r12). I guess they will appear to have a +1 cycle execution time, in perfect circumstances. (if the other warps happen to not use those resources).
Still, the vast majority of instructions will be fmad-like, executing in a single cycle (effectively, even if they have a latency of 10-20 cycles). So, you can infer how many simple instructions a GLSL function consists of.

You can often accurately measure high latency limited-resources’ minimum effective execution time by padding them with simple ALU ops, in a ratio. E.g loop(10){1 fsin, 8 fmad } if the trigonometry units are 8x fewer than ALU. Same for texture fetches, except that you have to make the texture small enough to fit in cache, with nice access-patterns.

Interesting idea, but I’m going on a trip so I will try it next week.

In testing shader I don’t have any drawing. Even explicitly call glEnable(GL_RASTERIZER_DISCARD). Whole transformation is done in vertex shader.

I’m not sure I have understood the rest of your suggestion. First, I want to use GLSL not the assembly language, so I have no control over instructions that are executed. Furthermore, GLSL compilers are very aggressive in optimizations.

What you call ALU is actually SFU (Special function unit) used for transcendental functions. Addition, multiplication and logical operations are done at SPU/DPU (some GPUs use same logic for both single and double precision, like Fermi, while others have separate DP units, like Kepler). SFU count is pretty high on modern GPUs. I don’t think they can make any trouble.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.