Nvidia Dual Copy Engines

Very interesting read:

Good to finally get high performance asynchronous data streaming to and from device memory.

But, the thing that annoys me is the note on page 14: “Having two separate threads running on a Quadro graphics card with the consumer NVIDIA® Fermi architecture or running on older generations of graphics cards the data transfers will be serialized resulting in a drop in performance.”

Why in hell not enable it for consumer products (if, and i may be wrong here, the hardware feature is present on all high end Fermi chips)? Texture streaming is extremely important there to. I am working in a scientific visualization context and we do have access to Quadro Boards, but we can not afford these a lot cards for every workstation where we develop and demonstrate large volume and image rendering software. The Fermi Quadro boards currently are extremely expensive, so access to them is almost impossible to us.

The data transport to the GPU is almost always the main bottleneck for us, so the decision to cut this feature (next to quad buffer stereo) is very sad. And i can imagine that D3D, at least for some games, will make use of the extra copy engines… So D3D gets stereo rendering (ok, i know no QBS, but still) and the other cool GPU features.

Sorry, but i get mad at such decisions.

Thanks for sharing the link. I was waiting to this paper for some time.

Actually, I am quite disappointed about their approach. I’d like to be able to use one opengl context/thread and still use both copy engines. I would rather see some solution allowing us to issue memory transfer(s) and gfx commands from the same context and still have it running in parallel. This solution would help everyone without making any changes in the code.

Making OpenGL context just for memory transfer is strange. This looks like workaround for me. Especially when PBOs are designed to be asynchronous. I understand the issue with one thread. It would break in-order pipeline execution. Maybe something like DirectX command buffers can make it better.

Also being this a Nvidia specific optimization would’n be better to use CUDA OpenGL interop and use two streams using async memcpies from/to pinned mem so we would make use of two copy engines with only one OGL context?
I have not thinked throughly about that but that should work…
even better this would work also in Teslas which by the way are more economical…
This of course would not work with AMD cards and code from whiteppaper although complex optimization for a simple problem works on AMD also…
But wait OCL supports OGL interop and even I think dual dma copy is usable from OpenCL world but I can’t be sure as I don’t own a Quadro/Tesla to test…
So best solution seems to use OCL OGL interop that should provide two benefits:
*one OGL context.
*works on tesla line also
*works on AMD also
*have to manage OCL context

Perhaps there could be some problems due to “hard” synchronization between OCL/OGL interop but I think with not yet implemented ocl 1.1 and ogl 4.1 advanced OGL/OCL interop should fix al possible issues…
What do you think?
Can someone at Nvidia speak about my reasoning?

Seconded, and for exactly the same reason.

Beyond this, a high-end consumer Fermi (GTX480) being out-benched by 2.6X by a last-gen card (GTX285) with data transfers is embarrassing (Re: slow transfer speed on fermi cards). At least make it as good as the last-gen boards.

Quadro card with consumer Fermi? That’s an odd modifier - so there’s Quadro cards and Quadro cards?

And we have to guess which it is?


By “high-end consumer Fermi” I meant high-end “consumer GPU” (i.e. GeForce, as opposed to their “professional GPU” line: Quadro) with a chipset based on the “Fermi” chip line.

They do have professional line (Quadro) Fermi-based GPU, but I wasn’t referring to them.

And as for guessing, while they do tell you on the NVidia pages what is “Fermi”-based, for more detail search the web. Reviews/wikipedia/etc. GFxxx chipset codenames are Fermi.

He was referring to the original Nvidia statement in my initial post, where they differentiate Fermi Quadros and consumer Fermi Quadros.

Ah, yeah. That is confusing.

i have been digging deeper into the DMA-engine stuff from Nvidia. What confuses me are the following points:

  • The white paper states that using a single threaded application and PBOs to transfer data to the GPU (upload) does not overlap the data transfer with the rendering due to an internal context switch. Is this right? I assumed that using PBOs i was not only able to overlap CPU work with transfers but also GPU rendering work and transfers.

  • The dual copy engines are only available on Quadro: Does this mean that i have a single copy engine on my GeForce to do one way overlapped transfers?

  • According to the white paper i need to use a separate thread and GL context to use the copy engine, as they are separate internal entities running GL contexts in parallel?

Maybe someone has already worked with the copy engines on GeForce and Quadro hardware and can give me some insights on my issues (or an Nvidia internal can clarify some points).


Hi guys,

I have spent some time on this problem too, here are my findings.

GeForce family cards are not able to do transfer and draw in one time in OpenGL!

Here is the picture from NVIDIA Nsight http://outerra.com/images/sc3_tex_upload.png. The green box is the glTexSubImage2D and it is called every fifth frame. As you can see the frame 978 is longer and the main part of the transfer is hidden in draw call time. In case of parallel transfer, the frame time should be the same. The texture is not used in any draw call so there is no implicit synchronization issue.

In CUDA, parallel transfer and kernel execution is possible http://outerra.com/images/cuda_transfers.png (red is for kernel and green/grey for download). The transfer can be upload or download it doesn’t matter.

The OpenGL data upload (texture and buffers) works in full speed on GeForce family which means ~5GB/s on PCe 2.0 and 2.5GB/s on PCIe 1.1. It seems to be the same speed as CUDA has acoording to bandwidthTest.exe --memory=pinned.

The texture download (glReadPixels) is limited on GeForce family to the almost unusable speed ~0.9GB/s on PCIe 2.0 and on older system with PCIe 1.1 is ~0.4GB/s. This is very sad especially because it is NOT hardware limitation. In CUDA i have download speed 3GB/s on PCIe 2.0 and 1.7GB/s on old PCIe 1.1. The problem is that download is not GPU side async so it can really slow down the application performance.

The OpenGL buffer download seems to be working in full speed on GeForce. The fastest way how to download texture to CPU memory is, call glReadPixel to buffer which is allocated in VIDEO memory (usage GL_STATIC_COPY) and then call glCopyBufferSubData to buffer in CPU pinned memory (usage GL_STREAM_READ).

a few values for 1280x720xRGBA8 (3.6MiB, PCIe 1.1) download:
fast nvidia download with glReadPixels and copy 0.7+2.12=2.82ms
direct way glReadPixels only 8.85ms

All stuff was tested on NVIDIA 460GTX 1GB (drv 285). Would be nice to find someone who will make such tests for Direct3D.

Wow, just wow! That is an amazing trick/hack using GL_STATIC_COPY and glCopyBufferSubData.

It made a huge difference to the pixel readback performance of my program.

I think someone should put this trick into the VBO wiki.

I guess that by using a GPU and CPU buffer with glCopyBufferSubData you are emulating the standard CUDA memcpy situation which would explain why it is fast.

I guess if you have a Quadro or whatever then the driver will do this pinned memcpy for you, but this seems to work nicely on GeForce.

One more thing to test l_hrabcak, what is the performance of glReadPixels into a PBO, and then use CUDA GL sharing to read it to CPU using cudaMemcpy?

The CUDA-readback was already tested in this thread [1], reading back the framebuffer content directly using CUDA. it gave very good results. it would be very interesting if using a CUDA memcpy on a separate CUDA stream would allow to overlap drawing and copying in GeForces…

[1] http://www.opengl.org/discussion_boards/ubbthreads.php?ubb=showflat&Number=291855#Post291855

Almost every buffer memory block with usage DYNAMIC or STREAM is page locked (pinned) memory. There are some limitations as for how much memory can be pinned, it all depends on current system resources. Slow glReadPixel transfer is not caused because of the non-pinned (paged) memory. For example, transfer to/from the pinned memory is (PCIe 1.1) 2.5/1.7 GB/s and for the paged memory 1.3/1.0 GB/s, which is faster than 0.4GB/s i have with glReadPixels. On Quadro or CUDA you are able to use glReadPixels-equivalent with full speed, which is even faster than this copy trick/hack.

One more interesting thing, GPU transfers like glCopyBufferSubData on GPU side (both buffers with usage GL_STATIC_COPY) are slower too, on GeForce NV460 GTX is the transfer rate only 10GB/s but in CUDA 87GB/s. Texture update from glTexSubImage2D from buffer in GPU memory is 6-10GB/s; the speed depends on pixel format, RGBA8 and BGRA8 are the fastest ones.

I recommend to read CUDA documentation, there are many things that apply to PC and GPU architecture and are common for OpenGL and CUDA like this memory stuff.

I didn’t test OpenGL and CUDA cooperation yet. But in this case the performance will be the same because OpenGL buffer copy CPU <-> GPU memory is as fast as in CUDA. But a direct copy from frame buffer object to CPU memory with CUDA should be faster.

A more interesting test would be to use CUDA for asynchronous GPU transfers to speed up the OpenGL applications, to upload textures and geometry in parallel with scene rendering. This should really help the engines that need to stream data to and from GPU during the game play, because in the current GeForce’s OpenGL implementation when you use glTexImage2D or glTexSubImage2D GPU it’s wasting power and doing nothing.

Any volunteers for this test? :slight_smile:

I did some CUDA tests:

Reading from a PBO in GPU memory via CUDA to CPU is the same speed as the glCopyBufferSubData trick.

Reading directly from renderbuffer via a CUDA array to CPU was significantly faster than glCopyBufferSubData trick.

I now have a 3x speedup over the older GL 3 only PBO code :-). Did anyone findout how RAGE uses glReadPixels (using glDebugger or Parallel Nsight)? I would not be surprised if they use CUDA too since the GPU transcoding stuff was written by NVIDIA engineers for RAGE using CUDA (rather than OpenCL which pissed off ATI gamers).

l_hrabcak & Chris Lux:
I was thinking yesterday of experimenting with asynchronous transfers but am still trying to figure out how to do it.

The problem is that on CUDA you have to use cuGraphicsMapResources which states “This function provides the synchronization guarantee that any graphics calls issued before cuGraphicsMapResources() will complete before any subsequent CUDA work issued in stream begins.”

This indicates that this function will stall the GL context. It also says using GL commands between map and unmap will produce undefined results… which means you can’t async overlap GL and CUDA officially.

Now what I want to know is what happens when you create a second GL context on another thread for CUDA transfers and keep rendering commands to the first thread. Will cuGraphicsMapResources stall both contexts?

It is really good to know that it works, thanks for testing.

It’s a big problem to report a bug to NVIDIA and to actually get an answer. So we cannot expect a help here, we are not the ID software. It is sad because AMD which has a lot of problems with OpenGL is much more cooperative in this and thanks to this cooperation with developers even small ones they finally managed to release usable OpenGL drivers. Otherwise, in NVIDIA case it is more about marketing strategy and the only way how to change it is to help AMD to create better drivers.

I made a few tests on a modified CUDA simpleGL example (more points and lower FPS around 150 to keep GPU busy all the time), and it looks the sync is not the issue here, it doesn’t cause a real sync. According to NVIDIA Nsight, I have the lag two frames which should not be possible with sync. http://outerra.com/images/simpleGL_mod_nsight.png

But there is an overhead due to context switching (0.4ms on my Intel Q6600 should be less than half on i5), it should be better to “offload” into another thread. The idea with the second thread is probably the only way how to efficiently work with CUDA and GL but we need to test a few different approaches to be sure.

Btw if you are referring to the “page resolver” here, you can resolve page IDs on GPU side and read IDs only. In this case the download should be a few kilobytes.

Btw if you are referring to the “page resolver” here, you can resolve page IDs on GPU side and read IDs only. In this case the download should be a few kilobytes. [/QUOTE]
Indeed I believe the CUDA part is not for read back, but to directly upload (custom) compressed texture data from CPU and decompress with CUDA to texture data suitable for GPU (s3tc, etc).

Sorry, trying to work this out. I’m using the ‘traditional’ PBO method now - I render to an FBO and glReadPixels to a PBO set up as GL_READ_ONLY. It sounds like you are saying that a two step process is better? Do you have a snippet showing the order of calls please? Are you using two PBOs? One ‘on the GPU’ and one with CPU access?


Yes two PBOs:

        glBindBuffer(GL_PIXEL_PACK_BUFFER, mOutputBuffer);
        glBufferData(GL_PIXEL_PACK_BUFFER, mOutputBufferSize, NULL, GL_STATIC_COPY);

        glBindBuffer(GL_COPY_WRITE_BUFFER, mCopyBuffer);
        glBufferData(GL_COPY_WRITE_BUFFER, mOutputBufferSize, NULL, GL_STREAM_READ);

I have this code to do the copy:

    glBindFramebuffer(GL_FRAMEBUFFER, mOutputFramebuffer);
    glBindBuffer(GL_PIXEL_PACK_BUFFER, mOutputBuffer);
    glReadPixels(0, 0, mWidth, mHeight, GL_RGBA, GL_UNSIGNED_INT_8_8_8_8_REV, 0);

    glBindBuffer(GL_COPY_WRITE_BUFFER, mCopyBuffer);
    glCopyBufferSubData(GL_PIXEL_PACK_BUFFER, GL_COPY_WRITE_BUFFER, 0, 0, mOutputBufferSize);
    glGetBufferSubData(GL_COPY_WRITE_BUFFER, 0, mOutputBufferSize, mOutput);