Measuring GLSL shader program running time

As clear in the subject, I wish to do time measurements on a shader program when executed.

I know of GetTickCount() … but the GPU on which shader runs is a set of processors and may have different completion timings.(Not sure about GLSL but in CUDA you have this thing and so synchronization is used there) …

Q1. Will GetTickCount() give me the proper time?

Q2.Following is an extract of shader setup and execution


{
... 

//create shader objects
 ..
 //read source
 ..

 //compile shaders
 ..
 
 //create program object
 ..
  
 //attach compiled shader objects
 ..
  
 // do linking 
 ..

 //Use / Execute program
glUseProgram(p);  

  //load variables -- uniform and user defined 
  ...
 
 //declare draw buffers

 //Quad Draw
 drawQuad(w,h); 

} 
 

Should I measure time at the installation/ execution of shader program OR at the drawQuad() , as declared??

Thanks.

GetTickCount will give you only 16ms precision. Its srccode is basically:
DWORD GetTickCount(){ return g_LastThreadTimesliceTime; }

RDTSC is basically the most precise way, but thread-affinity should be set to a specific core and program must be keeping the cpu multiplier at max. So, it’s useful only when doing research.

The best way to measure shader performance probably is to put it in a useful scene, and do many iterations while keeping FPS <=60 (a common mistake is to compare framerates of like 2500 vs 3300). There are many per-frame and per-drawcall initializations, which skew results a bit - so you’ll have to find a way to compensate. On top of it, stuff like scene complexity/overdraw/texture-size that can trash framebuffer/texture/ROP caches and skew results a lot.

Thanks Illian for replying. Following your advice I came across a post at
http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=260896#Post260896

There they talk about timeGetTime() and thread-affinity CPU core setting. I do not have experience using thread-affinity.Can you guide me to useful tutorials that have snippets.

Secondly, a GPU is multi-threaded in approach due to multiple processors that act on stream, will I be requiring thread affinity still?

I am sorry I am naive about thread-affinity.

SetThreadAffinityMask(GetCurrentThread(),1); // makes the thread only use core#1

That’s it :slight_smile:

GPUs are not just simply multithreaded, there’s a lot going on under the hood - but it’s not controllable, so don’t worry. Still, it’d be best to know the basics of your gpu; usually unveiled by sites like www.beyond3d.com , or in ATi’s case: via the opensource drivers and docs.

Agree and that is what I have been doing. The code extract posted earlier is called by a function where a huge amount of images are stacked in a 3D fashion. Then the shader processes on each image slice and returns.

Earlier I was calling the extract in a loop , iterating to number of slices. That seemed to be too slow so I came up with the idea to stack them all and use FBO for calculations.That has improved the speed but when I compare the timings on GPU [ using shader ] to normal CPU , there is a difference of 10 :1 rather than the opposite. i.e CPU is 10 times faster.

My CPU is a dual core AMD and GPU is GeForce 8800GS.

Then I came to know through CUDA that the function for time measurement I am using may not be accurate. But on second thought, if programming is done following traditional graphic pipeline using GLSL shader, then the threads are all independent . This prompted me to decide for GetTickCount().

Surprisingly, the CPU is faster than GPU !!!

The only way your cpu could be faster than that gpu is if your shader does a huge amount of incoherent branches. Or the algorithm and its datasets are in direct conflict with the way a gpu works.
For all graphics/shading/gpu-friendly-computing stuff, I am certain my humble GF8600 overpowers my beasty C2D E8500 @3.8GHz + DDR3@1.6GHz timing 7-7-7-20 :slight_smile:

The shader code does not use extensive branching and if it does then it uses GLSL inbuilt functions such as clamp() that I feel is optimized (posted a thread long back to confirm this) .

Basically, my code reads in a texture (RGBA – compatible to 8800GS and GL_FLOAT data type with GL_RGBA_FLOAT32_ATI as internal format. ) , compares a number at texel to unifrom variable value passed to it and then changes it if texel value is greater than this variable’ value.

Also I am using a clamp function to ascertain a min or max value if texel value falls off range.

This is all reasonable IMO.

Regarding My Q2. where I asked for the exact place to measure time, Should it be :


t= GetTickCount();
   glUseProgram(p);
t = GetTickCount - t;

Or just when I am about to draw quad for providing texel values
i.e


t= GetTickCount();
   drawQuad();
t = GetTickCount - t;

???

Thanks for your time.

A short question that will help me understand this:

I have 8800GS that has 96 stream processors and I have a loop to be iterate 2000 times.

If I use shader, will this loop run for 2000/96 passes?

Please clarify.

Bottleneck detected. Just before invoking shader , I am making a stack of slices and attaching them to FBO using a loop. Inside the loop after each attachment I am checking for error, using error function. As I removed it, the application is fast.

But still the problem with GetTickCount() remains.

I shall try to read more about PerformanceQuery Counter and yes as suggested by you, the GeForce 8 programming guide also mentions few things.

Shall trouble you again :slight_smile:
Thank you for your time.

this is not gonna work

t= GetTickCount();
glUseProgram§; or drawquad()
t = GetTickCount - t;

it has nothing to do with GetTickCount/PerformanceQuery
but simply that
when u call drawquad() etc It doesnt do it straight away, but instead sticks it on a DO THIS LATER list + then returns, thus in effect youre actually measuring nothing to do with drawing.
To actually perform whats on the DO THIS LATER list you have to call glFinish() or glFlush(), swapbuffers does one of these btw.

Just a reminder, note what I warned about GetTickCount. Let’s inspect this code:


t1 = GetTickCount();

glUseProgram(p); 
drawquad();
glFinish(); // must have

t2 = GetTickCount() - t1;


t2 will almost always be 0 !! Or 16, or 32, etc. The GetTickCount() result changes only when a cpu core switches between threads. I’d rather not explain indepth and nitpick, just try it for yourself. Then avoid GetTickCount :slight_smile:

Thank you Illian for the pointer. glFinish() is force flush and I used it in connection with glutSwapbuffers().

I relied on tutorial examples and missed glFinish().