GPU CPU parallelism

Hmm,

I’m not much a very advanced coder in OpenGL and i might have few knowledge, but what’s the use of blocking the CPU waiting for GPU to finish rendering whereas a simple swapbuffers call would be queued in a FIFO buffer and would enable the CPU to calculate next frame?

The question can be read as: isn’t calling glFinish the same as wasting CPU time to wait for GPU?

regards,

Originally posted by Korval:
[b]Well, a TNT doesn’t have a T&L unit. The only kind of parallelism you can hope to get is based on fillrate. So, if you’re drawing lots of small polys, you can’t get much parallelism.

In any case, use VBO’s. That should provide you with as much parallelism as you can get.[/b]
Hu?
As you said, a TNT doesn’t have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.
So?

“The question can be read as: isn’t calling glFinish the same as wasting CPU time to wait for GPU?”

I dont think anyone said you should use glFinish. The talk is about using glFlush to initiate DMA transfer and get the GPU to render WHEN you want, and you can continue using the CPU as the GPU does it’s job.

And I think it’s better to use SwapBuffers instead of wglSwapBuffers. I had some slowness on an old hardware with wglSwapBuffers. But I dont want to debate which to use here.

Originally posted by nystep:
The question can be read as: isn’t calling glFinish the same as wasting CPU time to wait for GPU?

OK, having already stuck my foot in my mouth as one of the original “glFlush blocks” people (see above) (and I did look it up in the MSDN and realize my mistake afterwards, which I should have done before posting, but anyway…), I’m kinda nervous saying anything, but for what it’s worth:

Yep, glFinish would “waste” CPU time waiting on the GPU. You would only use it if:

  1. You had nothing better to do on the CPU anyway, and:
  2. You needed to synchronize the CPU and GPU for some simulation correctness reason. (and apparently NV_FENCE is much nicer, although I know nothing about it), or:
  3. You wanted to know how long it took the GPU to do something. If you just profiled the rendering code without putting a glFinish at the end, you would only be profiling how long it took the CPU to issue (or possibly just to buffer) the commands.

All of which sounds like “normal” programs wouldn’t need glFinish.

Yep, glFinish is not everyone’s cup of tea but I have advocated it in the past to cut latency and/or sync input for improved consistency. It led to a big discussion/disagreement with a driver guy here, but people who write drivers and try to go for best fps above all else don’t share my priorities :-).

Your main loop will definitely change depending on your priorities, this becomes even more critical if you do anything to block on graphics, because nonblocking swap and a big FIFO hides a lot of these issues.

[This message has been edited by dorbie (edited 06-11-2003).]

As you said, a TNT doesn’t have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.

Which is why appropriate TNT drivers that implement VBO’s will not store them in video or AGP memory, regardless of what switches you use on them. Which is, of course, the whole point of having hints.

As I said, if you use VBO’s, you will get as much parallelism as you can get on the hardware you’re using. That it happens to be very little on some hardware doesn’t change that you’re getting what you can get.

Originally posted by MacReiter:
All of which sounds like “normal” programs wouldn’t need glFinish.

Any program that needs to read something back from gpu buffers,based on previous work, have to use glFinish.Unless you benchmark so precise that you know for sure that the information will be available after n frames/cycles/other on all relevant hw. I’ll let others decide how “normal” this behaviour is.

Finish() is not necessary before a ReadPixels(), because ReadPixels() implicitly finishes up to the point where you read.

If you’re trying to read “the screen” rather than the GL framebuffer using some out-of-band API, you’re probably in for trouble, as different implementations may use different mechanisms to show you the framebuffer (including video overlay!)

Originally posted by jwatte:
Finish() is not necessary before a ReadPixels(), because ReadPixels() implicitly finishes up to the point where you read.

For normal(hmm,what is normal?) read back functionality I think this is true, but extensions such as NVIDIA’s PDR bends these rules somewhat. Or relaxes them as the spec put it.Maybe there are other extensions that act similarly.

Yep, OpenGL is consistent on readback, there’s no need to glFinish, blocking is already implied with something like a readback call since the readback must wait for rendering to complete before fetching pixels. It doesn’t just implicitly flush it guarantees all relevant processing, including fragment processing is complete. I doubt there are any implementations that do anything smart beyond waiting on all pending fragment processing here. glFinish would be a bad thing to do immediately before readback since it would introduce the additional delay of the transport of the readpixels command to graphics that would otherwise happen during rendering. (not talking about extensions)

[This message has been edited by dorbie (edited 06-11-2003).]

Originally posted by zeckensack:
Hu?
As you said, a TNT doesn’t have T&L. VBO offers an abstract way to store geometry in card memory, which is only useful for T&L cards. For non-T&L, if geometry data resides on the card, it would have to travel back to system memory for transformation.
So?

VBO can still do all of the software-T&L optimizations that the ill-defined Compiled Vertex Arrays were used for in the past.

Originally posted by roffe:
For normal(hmm,what is normal?) read back functionality I think this is true, but extensions such as NVIDIA’s PDR bends these rules somewhat. Or relaxes them as the spec put it.Maybe there are other extensions that act similarly.

Yes, for PDR, you have to use fences to determine when a ReadPixels operation has completed.

The PBO (Pixel Buffer Object) extension - whose specification has unfortunately not yet commenced - will allow safe, asynchronous pixel transfers in a high performance, portable way.

Bother your local ARB representative if you’d like to see action on this extension.

Thanks -
Cass