GPU CPU parallelism

How should the rendering be done so that the parallelism can be maximised?
Is this possible on old cards (TNT)?

I tried to find out the time taken by the swapBuffers call when I was drawing a lot of stuff. I got 0 or 1ms :s

drawstuff();
glFlush();
//should the code that’s desired to be run in parallel inserted here?
swapBuffers();

Well, a TNT doesn’t have a T&L unit. The only kind of parallelism you can hope to get is based on fillrate. So, if you’re drawing lots of small polys, you can’t get much parallelism.

In any case, use VBO’s. That should provide you with as much parallelism as you can get.

The best way to get parallelism to to flush the pipeline just before you draw each frame. This means that all the drawing is being done by the card while your doing update calculations (physics, AI etc). The order is very simply:

while( !quit )
{
update( AI physics … )
glFlush()
swapbuffers
draw( everything )
}

That is a really attrocious main loop structure.

Why would you flush AFTER you do your compute?

On top of this you call swapbuffers immediately after which has an implicit flush anyway?

Bad.

Not my post, but thought I’d toss in my understanding/suggestion anyway:

You HAVE to do glFlush AFTER the calculations, or else you get no parallelism at all. The glFlush forces resynchronization between the GPU and CPU. So if you draw out your scene and then immediately glFlush before your calculations, you just stall the CPU until the GPU can catch up. Then the GPU sits there doing nothing while the CPU updates the AI. That’s why the AI calculations occurred BEFORE the glFlush.

Actually, while I know that the GPU can do a lot of work without needing further CPU intervention, there must be some limit to it. If so, the other glXxxxx commands will be forced to stall along the way. I wonder if it wouldn’t be better to interleave your processing in between large portions of your rendering code to distribute the load. Of course, the better way to do that would be to have your rendering code and your parallel code in separate threads so that if the rendering code had to stall waiting for the GPU, the parallel code could get work done. If you want frame synchronization, you can still use a mutex/semaphore/conditionvariable/event or whatever else is handy to keep the two threads synchronized on a per-frame basis.

An extremely vague version would look like this:

void RenderThread(void)
{
while (!done)
{
RenderWorld();
glFlush();
glSwapBuffers();
WaitForSingleObject(AiReadyEvent);
SignalEvent(RenderReadyEvent);
}
}

void AIThread(void)
{
while (!done)
{
UpdateAiSystems();
SignalEvent(AiReadyEvent);
WaitForEvent(RenderReadyEvent);
}
}

OK, there’s probably all kinds of hideousness there. It would probably be better to have a third thread that received notifications as Render and AI became ready, and then released an event when both were ready. I was just trying to get the shortest possible version out there.

PLEASE NOTE : if you try to use the above code, note that the two threads perform their wait/signal in opposite orders. This is necessary to avoid deadlock. I do not know if the order I chose is “optimal”, or if it even matters. This approach does not scale up nicely to more than 2 threads. There really should be some kind of “manager” thread, like I said…

Also, thread programming is not for the faint of heart. Debugging gets entertaining. I wouldn’t recommend adding threading to an otherwise single-threaded application just for this parallelism. But if you’re multithreaded anyway, what the heck.

Lastly, it is possible that thread context switching will be so slow that you won’t be able to get anything useful done during the short stalls anyway. You’d just have to test and find out. Compare the render/calculate/flush single threaded performance to the multithreaded performance and see which is better.

As for SwapBuffers doing an implicit glFlush, I don’t know. I can imagine a way in which it wouldn’t need to, but using such a system would make time synchronization fairly difficult. I chose to ignore the issue in the preceding code.

Of course, I’m just a hobbyist OpenGL programmer, so any of you who do this “for real” can feel free to point out all of the things I’ve overlooked or misunderstood

Mac

Korval, I’ll try to make a fillrate intensive app and see how much of parallelism I can get.

paulc, I don’t understand why a glFlush() has to be issued before swapBuffers. Shouldn’t it be issued before update(AI physics … ) so that the card will start executing?

Also the way I understand, glFlush returns immediately so it wouldn’t stall the cpu; swapbuffers , glFinish will stall the cpu. I was assuming flushing the pipeline would start the rendering and when the swapBuffers call is made the stalling will be very less if most of the stuff has been rendered.

Is “parallel time” the time taken by the swapBuffers call issued immediately after the drawing code?

Thanks for your replies, but I’m more confused now.

[This message has been edited by tarantula (edited 06-09-2003).]

Originally posted by tarantula:
paulc, I don’t understand why a glFlush() has to be issued before swapBuffers. Shouldn’t it be issued before update(AI physics … ) so that the card will start executing?

glFlush does not tell the gpu to start doing its job, but wait until it finishes it. So it’s to be called after you do your game computations, never before.

BTW, there is no point in calling glFlush and just after swapbuffer, as the later will call glFlush! Only call swapbuffer, get rid of the glFlush in your loop :
while ( !finished )
{
Render
UpdateGame
Swap
}

There is understandably some confusion on this as the GL spec is pretty vague in this area. Basically, all it guarantees is that glFlush will cause the commands you just queued to complete “sometime” and that glFinish won’t return until all commands are complete.

So… what does this really mean? In my experience, calling glFlush sends any queued commands to the hardware but doesn’t wait until the commands are finished before returning. glFinish also sends any queued commands to the hardware and DOES wait until the commands are finished before returning. So calling glFlush won’t stall the CPU but glFinish probably will.

As others have mentioned, SwapBuffers does an implicit glFlush, NOT an implicit glFinish, so in most cases it won’t stall the CPU. The few cases where it will stall the CPU are where the CPU has so many frames queued up that it makes sense to throttle back for interactivity reasons.

Hope this makes sense.

– Ben

Originally posted by tfpsly:
glFlush does not tell the gpu to start doing its job, but wait until it finishes it.

No, glFlush does not block.
Personally I like facts. From http://wwws.sun.com/software/graphics/OpenGL/manpages/glFlush.html

DESCRIPTION
Different GL implementations buffer commands in several dif-
ferent locations, including network buffers and the graphics
accelerator itself. glFlush empties all of these buffers,
causing all issued commands to be executed as quickly as
they are accepted by the actual rendering engine. Though
this execution may not be completed in any particular time
period, it does complete in finite time.

  Because any GL program might be executed over a network,  or
  on an accelerator that buffers commands, all programs should
  call glFlush whenever they count on having all of their pre-
  viously   issued  commands  completed.   For  example,  call
  glFlush before waiting for user input that  depends  on  the
  generated image.

NOTES
glFlush can return at any time. It does not wait until the
execution of all previously issued GL commands is complete.

cry

Why are there so many people believing that glFlush blocks? Seems to be in the top list of OpenGL myths…

Y.

Hmm… now swapBuffers is not a blocking call? Doesn’t swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.

Btw, I do understand what glFlush does. But I am not sure where I can get the parallelism from. Somebody please tell me how the parallelism can be achieved. jwatte? dorbie?

Originally posted by roffe
No, glFlush does not block…

interesting. Thanks!

Originally posted by tarantula:
Hmm… now swapBuffers is not a blocking call? Doesn’t swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.

Of course it does!

Btw, I do understand what glFlush does. But I am not sure where I can get the parallelism from. Somebody please tell me how the parallelism can be achieved. jwatte? dorbie?

It comes from the fact that while the gpu is finishing its job, the cpu is free and it can be used for whatever you want it to do. Then whe you’re finished, swap the buffers (stalling the cpu until the gpu is done) and both PUs will get synchronized.

Originally posted by tarantula:
But I am not sure where I can get the parallelism from.

Achieving good parallelism is hard.You must benchmark your app extensively to find bottlenecks and move operations around, so you find a good equlibrium between cpu and gpu workload. If using only glFlush/glFinish for synchronization you are left with:
i) send work to gpu
ii) do variable amount of cpu work
iii) block, more cpu or more gpu work

By using extensions such as NV_FENCE you can poll the gpu for partial completion. Which lets you do:
i) send work to gpu
ii) some cpu work
iii) poll gpu
iv) more cpu work
v) poll,block,whatever

NAME
glFinish - block until all GL execution is complete

That must be what we’re looking for. Before reading the thread, i must say i beleived glFlush blocked too.

Originally posted by tarantula:
Hmm… now swapBuffers is not a blocking call? Doesn’t swapBuffers return after the buffers are swapped?? Then, the drawing must be completed before swapping and hence swapBuffers must stall.

You can queue up the swap just like you can queue up any other rendering call…

– Ben

“No, glFlush does not block…”

It blocks until commands are sent to the server, which is pretty quick on a PC.

But Im not sure this is a needed function. I think that on PC’s as soon as a command is called it is executed or it is executed as soon as possible.

Take this dumb example :

glBegin(GL_TRIANGLES);
glVertex3f(0.0, 0.0, 0.0);
glVertex3f(1.0, 0.0, 0.0);
glVertex3f(1.0, 1.0, 0.0);
glVertex3f(2.0, 2.0, 0.0);
glVertex3f(5.0, 5.0, 0.0);
glVertex3f(3.0, 3.0, 0.0);
glEnd();

When the third glVertex is called, the triangled is rendered. When the sixth glVertex is called, the triangle is rendered.

This is something I observed in software mode, but it may not be true for hw.

For the case of glDrawRangeElements, I imagine the whole arrays must be uploaded before execution begins.

Originally posted by V-man:
[b]“No, glFlush does not block…”

It blocks until commands are sent to the server, which is pretty quick on a PC.

But Im not sure this is a needed function. I think that on PC’s as soon as a command is called it is executed or it is executed as soon as possible.

Even on PC’s that’s not really the case.

Normally commands are put into a buffer, be it the DMA buffer itself (uncached memory, normally AGP write-combined) or a temporary buffer (cached memory) which will need to be copied to the real DMA buffer sometime.

In both cases you must have some granularity to initiate DMA transfers (initiating them on each command you put into the buffer is a nono).

So you DO need glFlush to tell the driver that now is the right time to initiate a DMA transfer, otherwise the DMA transfer won’t begin until you’ve run out of space in the buffer or until you’ve reached some granularity hardcoded in the driver (or not so hard-coded if the driver has some load-balancing heuristics).

Regarding whether wglSwapbuffers blocks or not (i.e. calls internally glFinish or not), that depends on the OS (yes, win9x behaves differently to Win2k and both differently to WinNT), registry settings and some wacky things the driver can do to avoid glFinish being called from wglSwapbuffers internal code (tracing back the call stack).

[This message has been edited by evanGLizr (edited 06-10-2003).]

I wouldn’t count on a lot of parallelism on a TNT2.

I WOULD count on a lot of parallelism on a GeForce2.

Note that the cards are likely to start rendering even before you call glFlush()/SwapBuffers(), unless the cards are tile renders like the i845. If you issue some geometry, and the card is idle, might as well go ahead and kick it off.

To get parallelism, you don’t really need to do anything special, as long as you issue vertex array geometry with “common case” geometry states. Doing any kind of read/get, or excessively uploading textures and stuff, will probably cause blocking/stalls/less parallelism.

If you were to try to get parallelism out of a TNT2, you’d have to do something like rendering all your small geometry first (where small triangle fill might overlap with the transform of the next thing) and then draw your large geometry (walls, sky box, whatever) and call glFlush() before starting the calculation for the next frame.

Of course, if all you do is:

forever() {
calculate
draw
swapbuffers
}

Then that’s pretty much what you’re doing, anyway.

It’s already been said but flush is non blocking and swapbuffers performs an implicit flush. When you know this a lot of it is self evident, although big FIFOs etc make it somewhat moot. It does still depend on where your bottlenecks are and that depends too on your hardware.

I seem to value latency more than most and would do some things in my loop you may not than make these issues more critical.

FYI on windows (and Linux) swapbuffers will block IF there’s another swapbuffers in the queue. FIFOs are large and can infact store several frames in some instances hosing your latency, so the policy is to limit the FIFO to one frame. This varies with implementations. Issue so much as a glNormal after swap on an IRIX and you block. There’s a good reason for this, but it’s lost on people who don’t even sync to vertical retrace.

Parallelism comes from things like FIFOs on the card that can store data and commands, DMAs by the card and display listed memory on the card and in your mapped agp memory. One of the advantages of a GPU, is that your CPU is only concerned with dispatch not T&L and even then not busy with it because you’re smart about dispatch. Even without a GPU you could benefit from graphics parallelism while the card is busy with ‘setup’ and fragment processing. Even with the best card you have to be careful not to do anything that would block the CPU of course, even the smallest glReadpixels for example, but other things might do it and implementations and extensions can add their own quirks.