Streaming framebuffer readback

My program streams vertices into the GPU and streams the rendered (and antialiased) framebuffer back.

For vertex streaming I use the technique reccomended by Rob: http://www.opengl.org/wiki/Buffer_Object_Streaming

Now I want to maximise the performance of the framebuffer readback.

This is what I currently do:
Init:
create first renderbuffer with multisampling
create a framebuffer and attach AA renderbuffer to it
create second renderbuffer without multisampling
create another frame buffer and attach non-AA renderbuffer to it

Frame start:
Bind AA framebuffer to GL_FRAMEBUFFER
Clear to white

~Draw stuff~

Frame end:
Bind non AA framebuffer to GL_DRAW_FRAMEBUFFER
Bind AA framebuffer to GL_READ_FRAMEBUFFER
Resolve AA framebuffer to non-AA using glBlitFrameBuffer
Bind non AA framebuffer to GL_FRAMEBUFFER
Bind output pixel buffer to GL_PIXEL_PACK_BUFFER
Copy non AA framebuffer to PBO using glReadPixels
Map output PBO
memcpy output image from PBO to heap alloced buffer
Unmap output PBO

~Do something with output~

Repeat

Currently my ‘do something with output’ is just saving the result to a PNG file, but later will change to storage (as a PNG) in a memory or file cache.

So what algorithms, techniques, tricks are out there for speeding this up?

Only one I have found is to have two PBO’s and ping-pong them. Does using more than 2 provide more boost, or does the driver internally keep several of them when you ‘orphan’ the PBO?

What about multiple sets of framebuffer/renderbuffers?

The fastest I managed to get so far was by using several PBOs (I think it was 5).

Another interesting thing, when I tried using glBufferdata NULL before glReadPixels it was a lot slower than not using it.

Bind output pixel buffer to GL_PIXEL_PACK_BUFFER
Copy non AA framebuffer to PBO using glReadPixels
Map output PBO
memcpy output image from PBO to heap alloced buffer
Unmap output PBO

The moment you map the output PBO for reading, the CPU thread will stall until the entire scene is rendered and the blitting has occurred.

A much better way to do this is to wait until the absolute last possible minute to map the buffer object. That is, there comes a moment when you need the data in your hands on the CPU. Do not map the buffer until that moment arrives. And push that moment off in your code for as long as possible.

Only one I have found is to have two PBO’s and ping-pong them. Does using more than 2 provide more boost, or does the driver internally keep several of them when you ‘orphan’ the PBO?

Orphaning doesn’t really work for PBOs. Not the way you intend to use them. Orphaning allows OpenGL to read from the memory after you’ve stopped caring about accessing it. Since you care about accessing it, indeed, you’re about to map it, you can’t just reallocate the buffer after the read.

So you need multiple PBOs. How many depends on how many frames behind you are.

Also, don’t map the buffer like this. If the only thing you’re going to do when you map a buffer is perform a single, direct memcpy (either to write into it or read from it) from/to a piece of memory, just use glBufferSubData or glGetBufferSubData. Let OpenGL do the memcpy for you; it’ll be no slower and may be rather faster.

What about multiple sets of framebuffer/renderbuffers?

That’s probably not going to help. Since the GPU can only do one thing at a time, there won’t be a synchronization event where it waits for something.

The synchronization comes from when you decide to access the data you just read into the buffer object.

The fastest I managed to get so far was by using several PBOs (I think it was 5).

If you’ve got a delay of 5 buffer objects, you must not be doing very much on the CPU.

The reason I thought I might need multiple frame buffers is that frame 2’s draw will wait for frame 1’s glReadPixels to finish using the framebuffer it needs to draw into.

I have tried testing it on the traget machine rather than my desktop to see how I’m doing.

On my test machine with a Tesla M2050, Intel Xeon X5570 2.93GHz:
(For a 256x256 tile with 1 triangle - just trying to test transfer speeds)
1 PBO: 2598.592921 Tiles per sec
2 PBO: 3462.203011 Tiles per sec
3 PBO: 3445.377595 Tiles per sec
4 PBO: 3477.882499 Tiles per sec
5 PBO: 3473.232226 Tiles per sec

Seems like 2 PBO is the fastest now.

I need to get this to at least 5000-6000 Tiles per sec as my desktop can render far more complicated tiles on the CPU using Antigrain Geomerty library at ~4000 Tiles per sec. My desktop is only 1.86GHz Core 2 with Gefore 9400 GT!

(For a 256x256 tile with 1 triangle - just trying to test transfer speeds)

You really shouldn’t be rendering and DMA-ing such small things. A 256x256x32-bit texture, with ~3400 per second, means you’re getting about 850MB per second of data transfer.

PCIe 8x version 1.0 gives you a maximum transfer of 4GB/sec. Total (that is, upload + download). You’re hitting about a quarter of that. Considering that you’re not going in the preferred direction, that’s good.

However, you’re likely not measuring anything more than overhead. You may be rendering 3400 tiles, but that’s no excuse for clearing the framebuffer 3400 times. Or doing antialiasing resolve 3400 times. Or doing whatever overhead there is for a pixel transfer 3400 times. And so forth.

What happens if you render 16 tiles to a 1024x1024 buffer? That’s an order of magnitude fewer DMA requests, antialiasing resolves, state changes, and other stuff. Or maybe 64 in a 2048x2048 buffer. You might even be able to push up to 4096x4096 (though the size of your multisample buffers are starting to become a problem).

I need to get this to at least 5000-6000 Tiles per sec as my desktop can render far more complicated tiles on the CPU using Antigrain Geomerty library at ~4000 Tiles per sec.

You can “need to” all you want; that doesn’t mean it’s going to happen.

Don’t forget. GPUs are optimized for data going in one direction: from the CPU to the GPU. Going backwards isn’t exactly the fast path. It’s not unacceptably slow, but you can’t expect it to be everything you want it to be. What you get is what you get.

You really shouldn’t be rendering and DMA-ing such small things. A 256x256x32-bit texture, with ~3400 per second, means you’re getting about 850MB per second of data transfer.

Funny I had the same idea on the way home. I will try this out tomorrow.

What I am not sure on though is whether to render say 4x4 tiles or 1x16.

4x4 provides smaller drawing batches (tiles are adjacent), but makes it a lot harder to extract individual tiles from the PBO.

1x16 has more draw batches, but the tile data is continuous in the PBO.

Going backwards isn’t exactly the fast path. It’s not unacceptably slow, but you can’t expect it to be everything you want it to be. What you get is what you get.

Why is backwards slow? Doesn’t CUDA rely on fast speeds to and from the card? As this is a Tesla card I would expect both directions to be optimised, unless you need a Quadro card to unlock full backwards copy speed.

I also found a NVIDIA document describing how to use the Dual Copy Engines on the Fermi Quadro cards.

In CUDA Tesla cards support dual copy engines, but I am not sure if they also enable it for OpenGL…

Quadro cards are not as suitable as Tesla for servers, and they are slower at CUDA than Tesla, but NVIDIA do advertise one card the C2070Q as having the Quadro engine… whatever that means (as you obviously don’t have a display to use 90% of Quadro features like stereo or 30bit output…)

The document there also seems to hint that multiple draw jobs from multiple contexts/threads are done synchronously. Why would that be? The Fermi cards have multiple job schedulers that let you run several CUDA kernels at once so it is strange you cant also do drawing to different framebuffers simultaneously too.

4x4 provides smaller drawing batches (tiles are adjacent)

What does adjacency have to do with batches? Even if you’re drawing triangle strips, you can just primitive-restart to the next tile. If you can batch between tiles at all, then you can batch equally well in either arrangement.

Doesn’t CUDA rely on fast speeds to and from the card?

No, it doesn’t. CUDA, and all other GPGPU applications, are predicated on a fundamental presupposition. For a given task:

the GPU parallel computation time + the upload/download time to the GPU < CPU (possibly parallel) computation time.

The longer and more parallel friendly the computations are, the more likely this inequality is to be true. The shorter and simpler the computations (say, scan converting a single triangle at a tiny resolution), the less likely this is to be true.

The Fermi cards have multiple job schedulers that let you run several CUDA kernels at once so it is strange you cant also do drawing to different framebuffers simultaneously too.

And what good would that do? Unless there are rendering resources going to waste, ROPs, shader processors, etc, there would be no advantage to effectively breaking a GPU into two pieces that don’t talk to one another. GPUs have a great deal of logic, in both hardware and software, dedicated to making sure that it keeps as much of its computations resources active. This is far more effective than threading rendering to different buffers (since very few rendering applications would use it).