Timing transform feedback

Hi all,
I want to know the amount of time taken by transform feedback mechanism. Currently, I am doing it like this (in pseudocode),


start timer;
   glBindVertexArray(...);
   glBindBufferBase(...);
   glEnable(GL_RASTERIZER_DISCARD);    // disable rasterization
   glBeginTransformFeedback(GL_POINTS);
   glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN, query);
      glDrawArrays(GL_POINTS, 0, MAX_MASSES);
   glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);      
   glEndTransformFeedback();
   glDisable(GL_RASTERIZER_DISCARD);			
stop timer;

This however return 0 secs. Is this the correct way?

OpenGL operations are typically asynchronous. They will be executed sometime after you actually call those commands.

If you want to know how long something takes on the GPU, use ARB_timer_query, which was made core in GL 3.3.

Thanks for the prompt reply Alfonse.

OK I have got my times. One question though. I think for correct time calculation, I should call glFinish to make sure every GL call is finished before the call to glEndQuery(GL_TIME_ELAPSED);
If i dont call glFinish the times are significantly different. Should my timing calculation call glFinish?

You need to call glFlush, to make sure that the time-elapsed token isn’t late, but not glFinish.

ok thanks for that.

No, you don’t need to call glFinish()/glFlush() at all!

The purpose of ARB_timer_query is to enable asynchronous execution. Previous commands are very heavyweight synchronization fences.

glGetQueryObject*() is a blocking function, so it is not very useful to call it immediately after glEndQuery(GL_TIME_ELAPSED). Call glGetQueryObject*() as late as possible, or even better in the next frame.

HI Aleksander,
Thanks for the info. Now this is confusing for me since removing the glFlush call gives a significantly different time.

That’s because he doesn’t understand the issue.

When you execute an OpenGL command, it may not yet be put into the GPU command stream. It may wait in an internal buffer somewhere until the GPU’s command stream is empty or nearly so.

If the command stream completely empties, then some part of the GPU is idle. If the timer query token was not placed into the command stream yet, then you will be measuring the time it takes for the token to be placed into the stream in addition to the time it takes for the commands to execute.

Hence the need to flush. And do note that the flush should happen after the glEndQuery call, not before it.

Maybe I didn’t understand the problem, but how can you be sure that you did? :slight_smile:

I assumed that there are other commands executed after glEndQuery(), and also that there is at least SwapBuffers() which would internally call glFlush().

Having glFlush() call before glEndQuery() wouldn’t give correct time since glFlush() would also be included in the total range.

The previous two sentences are contradictory. Or I have misunderstood something again. Well, there is a command buffer, and it is flushed when it is full, or a glFlush()/glFinish() is called. That is a consequence of old and well know client/server organization of OpenGL. Can you post a link to that “stream based” solution that pulls commands from client’s internal buffers? It is a quite new concept to me.

I can also say that no need to call glFlush. ARB_timer_query provides a transparent method to measure the server side time for any particular command set, without affecting the rest of the code (at least, that is how I understand that as the extension spec doesn’t say anything about requiring a glFlush to get accurate measurements).

Having glFlush() call before glEndQuery() wouldn’t give correct time since glFlush() would also be included in the total range.

I didn’t say anything about calling glFlush before glEndQuery. Indeed, I said the opposite: “And do note that the flush should happen after the glEndQuery call, not before it.”

The previous two sentences are contradictory. Or I have misunderstood something again.

This Wiki article explains how synchronization works.

Just because you call a function does not mean that the corresponding GPU command has been issued to the GPU. That’s the point I’m getting at. If the GPU’s command buffer empties while there are commands waiting to be processed, then that stall will be part of the timing.

Now granted, one might expect glEndQuery to perform a flush internally.

I didn’t say that you had said that. I just wanted to emphasize why someone shouldn’t do that.

This Wiki article is written by you. Are you working for some hardware vendor? if not, what is the source of those claims?

I agree with that. That’s quite clear.

That’s a little bit odd, because it is not quite clear why those awaiting commands are not pulled if the command queue is empty. For me “the story about client/server organization” is more “digestible”.

Now granted, one might expect glEndQuery to perform a flush internally.

glEndQuery() doesn’t perform a flush internally. The purpose of timer query extension is to avoid any synchronization stalls. Unless one calls GetQueryObject*() and the counter is not ready, there is no stalls at all.

This Wiki article is written by you. Are you working for some hardware vendor? if not, what is the source of those claims?

The OpenGL specification.

That’s a little bit odd, because it is not quite clear why those awaiting commands are not pulled if the command queue is empty.

Because the CPU has to put them there. If the command queue is full when you call an OpenGL function, then it can’t put it there. Therefore, the driver must wait until sometime later. Even if the driver is threaded, that doesn’t ensure that it will have a timeslice available when the queue starts to empty.

The purpose of timer query extension is to avoid any synchronization stalls.

It most certainly isn’t. The point is to get accurate timings for OpenGL operations. GPU timings. A flush is a CPU stall, not a GPU stall.

Also, either it performs a flush or you do; there’s no other way to get accurate GPU timings.

The glBeginQuery call will attempt to put some kind of token into the command stream that tells the GPU to start the clock. The glEndQuery call must therefore put a token into the command stream that causes it to stop the clock. The only way to get an accurate timing from one to the other is to ensure that there are no GPU stalls between the begin and the end (other than those caused by the regular processing of the GPU commands, of course).

And there’s only one tool for doing that: a flush. Halt execution of the user’s code while constantly polling the GPU command queue, putting tokens in as fast as possible until all have been issued to the queue.

Is it possible that the driver has some mechanism for doing so that doesn’t stall the CPU? Possibly. But something’s going to have to ensure that there are no GPU issuing delays.

This is ultimately the same reason you have to use glFlush when you create fence objects with ARB_sync: to ensure that the token is added to the command stream in reasonable time.

Thanks for a healthy discussion Alfonse, aqnuep and Aleksander.
Ok so this is how I am doing it now. Is this correct?


glBeginQuery(GL_TIME_ELAPSED,t_query); 
  glBindVertexArray( vaoUpdateID[writeID]);
    glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, vboID_Pos[readID]);
    glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 1, vboID_Vel[readID]);
       glEnable(GL_RASTERIZER_DISCARD); 
	 glBeginTransformFeedback(GL_POINTS);
 	   glDrawArrays(GL_POINTS, 0, MAX_MASSES);
	 glEndTransformFeedback();
    glDisable(GL_RASTERIZER_DISCARD);	
glEndQuery(GL_TIME_ELAPSED); 
glFlush();
// get the query result 
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time); 
printf("Time Elapsed: %f ms
", elapsed_time / 1000000.0); 

It should work as you would expect. The only question is the necessity of the glFlush after glEndQuery, however let’s not argue on that.
Maybe you can try with and without it a few times and tell us the results. I know that it wouldn’t proof anything if the query results would be the same, but it could prove Alfonse’s theory if there are big differences between the two (more precisely if without glFlush the value would be reasonably higher).

I beg you to point me to the file/chapter/page where in the spec that is defined. You have used a description of how drivers might implement commands execution. That is not proscribed by the spec, so it shouldn’t be the part of it. The word “driver” is rarely used throughout the spec and it is easy to find all occurrences. States you are used in the Wiki articles are not part of the spec either, although your classification of states is pretty reasonable. I’m just curious if that is really implemented somewhere.

Depends on what you want to measure. It is used for measuring GPU time, but, as you said, there is no guarantee that the same set of instruction will have the same execution time over multiple invocations. There are a lot of drivers optimizations, as well as other circumstances that can have impact on execution time.

Why this should stall CPU? Unlike glFinish(), which is a blocking function, glFlush() just flushes command buffer. Of course, if we strictly speaking about execution time, there is a time driver needs for executing the command and flushes command buffer, but the penalty is probably not great considering CPU. It have to be measured. :confused:

The reason for using glFlush() in synchronization across multiple contexts is the fact we have to deal with totally independent GL servers. Each rendering context is a GL server “per se”, with it’s own command buffer. It is possible event to wait forever if you are locked to another thread associated with the context with just several commands issued (and waiting to be flushed).

It’s ok if you don’t care about overall performance. Try to measure the time with and without glFlush() and report the differences. I’m wondering if there is any. You can improve performance if you remove glFlush() and displace glGetQueryObjectui64v() as much as possible. The best solution is to display results from the previous frame.

Hi,
As far as the results are concerned, there isn’t much difference if I remove glFlush call.

The results might differ if you are waiting something in your code and don’t issue GL commands after the feedback code you want to measure. That’s what Alfonse wanted to say. But it is a rather pathological case. In the spirit of good graphics programming, you shouldn’t block drawing while waiting some other calculations to finish (or even worse, for user input). So, our assumption about your code was right. I’m glad for that. :slight_smile:

It’s okay if you don’t mind about overall performance (eg. the code won’t be included in finished app), but querying the result with

glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);

directly after ending the query will cause a stall until all the commands issued before ending the query are fully complete. Including your own glFlush() or glFinish() call should have no effect on performance, since querying the result with glGetQueryObjectui64v + GL_QUERY_RESULT is almost like issuing a glFinish() call, since it can’t return until all the previous commands are finished.

You either want to query the result at a later stage:

glBeginQuery(GL_TIME_ELAPSED,t_query);
...
glEndQuery(GL_TIME_ELAPSED);
// ... some time much later, perhaps next frame ...
glGetQueryObjectui64v(t_query, GL_QUERY_RESULT, &elapsed_time);

Or you can check whether the result is available before actually asking for the result without stalling using:

glGetQueryObjectiv(t_query, GL_QUERY_RESULT_AVAILABLE, &available);

where you could do something on the CPU in a loop while waiting for the result to become available.