Slow transfer speed on fermi cards

mfort · February 14, 2011, 12:19pm

The performance degradation is due to NVIDIA power saving feature (a.k.a. Powermizer). When the driver detects low load it decreases the GPU or memory clock. You can disable it NVIDIA control panel. You can monitor the clocks using GPU-Z to be sure that the tests are running with max speed.

Chris_Lux · February 15, 2011, 12:27am

ah, i did not notice that before… thanks for the hint.

i still played a bit with the benchmark to maximize the throughput of the readback.

With the PBO glReadPixels I achieved:


glReadPixels(0, 0, 1280, 720, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));


PBO glReadPixels: 2.35 ms 1497.91 MiB/s (memcpy: 0.49 ms 7143.67 MiB/s) total: 2.84 ms 1238.27 MiB/s
glTexSubImage2D: 1.12 ms 3132.71 MiB/s

I think 1.2GiB/s are a joke on a x16 PCI-Express card. So i tried glGetBufferSubData instead of a map + memcopy for the readback:


glGetBufferSubData(GL_PIXEL_PACK_BUFFER, 0, 1280*720*4, mem);


PBO glReadPixels: 0.06 ms 63069.52 MiB/s (memcpy: 3.75 ms 937.86 MiB/s) total: 3.80 ms 924.11 MiB/s
glTexSubImage2D: 1.31 ms 2693.62 MiB/s

Under 1GiB/s… Is there anything that can be done to increase the readback throughput?

Chris_Lux · February 16, 2011, 4:44am

Quadro 6000, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)


glReadPixels: 2.06 ms
PBO glReadPixels: 0.66 ms (memcpy: 0.52 ms) total: 1.18 ms
glTexSubImage2D: 1.13 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.44 ms) total: 0.49 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 4.42 ms

memcpy speed: 7053 MBytes/sec

Total frame: 16.70 ms  (total transfer: 8.21 ms)

GeForce GTX 285, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)


glReadPixels: 3.14 ms
PBO glReadPixels: 3.11 ms (memcpy: 0.30 ms) total: 3.41 ms
glTexSubImage2D: 2.32 ms
PBO glTexSubImage2D: 0.06 ms (memcpy: 0.49 ms) total: 0.56 ms
glCopyTexSubImage2D: 0.04 ms
glGetTexImage: 9.30 ms

memcpy speed: 12103 MBytes/sec

Total frame: 27.75 ms  (total transfer: 16.44 ms)

GeForce GTX 580, 266.58, Windows 7 x64 (Core i7 980, 12GiB RAM)


glReadPixels: 8.30 ms
PBO glReadPixels: 2.36 ms (memcpy: 0.55 ms) total: 2.91 ms
glTexSubImage2D: 1.20 ms
PBO glTexSubImage2D: 0.05 ms (memcpy: 0.48 ms) total: 0.53 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 4.44 ms

memcpy speed: 6762 MBytes/sec

Total frame: 28.68 ms  (total transfer: 16.24 ms)

THIS is another joke brought to you by Nvidia. I do not think the dual copy engines do something here, because how this benchmark works they are not needed (no parallel transfer required).

Here another test in my own software:
http://h-4.abload.de/img/readback_bench9tgh.png

Explanation:
read: orphane buffer0, bind to PIXEL_PACK_BUFFER, glReadPixels
copy: map buffer1, memcopy image data, unmap buffer1
tex: not used
swap(buffer0, buffer1)

The gpu times are taken using timer queries, the cpu times using performance counters.

It is clear, that the on-device copy from the framebuffer to the unpack buffer is much slower on Fermi GeForces (read gpu time).

Edit: Maybe we have to wait and see how they optimize this for Rage and then we can go and start again to do what id does, just because the drivers do it better for them.

mfort · February 16, 2011, 8:50am

NVIDIA is well aware of this “problem”. They want to increase Quadro sales. Most games do not need reading pixels back.

The only workaround I found is the use of CUDA. I also tried OpenCL but no speedup there. CUDA can copy FBO renderbuffer to system memory at the same speed as Quadro with PBO.

Chris_Lux · February 16, 2011, 12:50pm

I know that they are aware of this, but as a developer it is frustrating to work around these issues. Especially if these issues were not there in the last generations. A 2.5 times slower transfer due to artificial throttling is just stupid. And to introduce CUDA into the software just to get data fast to the host side is also insane… The transfer is fast with CUDA as long as Nvidia sees us abuse this API just for that.

What i like to see is a definitive list of features cut or artificially broken just to sell Quadros.

As i said, id Softwares Rage will depend on a fast readback. So maybe it will be enabled in an application profile directly for Rage (if they even use OpenGL in the release version).

Such things are so frustrating…

P.S. Do you have a small code snipped for fast FBO to PBO/host memory using CUDA or OpenCL (even if it is not faster than the current approach)?

imported_kyle · February 16, 2011, 12:57pm

Maybe rename your binary name ‘rage.exe’ and see what happens

mfort · February 17, 2011, 12:21am

CUDA workaround:

headers:


#include <cuda.h>
#include <cuda_runtime_api.h>
#include <cuda_gl_interop.h>
#include <cudaGL.h>

#pragma comment(lib, "cudart.lib")

struct cudaGraphicsResource *cudaGfxRes;

Initialization:


    // init CUDA
    cudaError_t cErr = cudaGLSetGLDevice(0/*GPU number*/);

Memory allocation:


    // for best performance, allocate pinned memory
    void* ptr;
    cudaError_t cErr = cudaMallocHost(&ptr, sizeInBytes);

Registering OpenGL RB and CUDA


// this is done only once
cudaError_t cErr = cudaGraphicsGLRegisterImage(&cudaGfxRes,  
                           renderBufferId, GL_RENDERBUFFER, 
                           cudaGraphicsMapFlagsReadOnly);

Data transfer:


cudaError_t cErr;
struct cudaArray* cArray;

cErr = cudaGraphicsMapResources(1, &cudaGfxRes);
cErr = cudaGraphicsSubResourceGetMappedArray(&cArray, cudaGfxRes, 0, 0);
cErr = cudaMemcpyFromArray(dstMemPtr, cArray, 0,0, 
                   sizeInBytes,  cudaMemcpyDeviceToHost);
cErr = cudaGraphicsUnmapResources(1, &cudaGfxRes);

Chris_Lux · February 17, 2011, 1:10am

Thanks!

Could you also post the OpenCL equivalent? I want to take a look at the exact performance differences.

Regards
-chris

mfort · February 20, 2011, 12:10am

I am not a OpenCL expert. At this moment I am not quite sure about my implementation. I’d rather not to post it here and make wrong impression about OpenCL. If I find OpenCL implementation of the same speed as CUDA I will come back.

def · May 6, 2011, 7:04am

I am not very familiar with CUDA. Can you register an unsigned byte FBO and do the cudaMemcpy? Is this what you have done?

It seems only FLOAT32 or unsigned INT formats are supported with cudaGraphicsGLRegisterImage().

If this really works, a glCopyTexSubImage2D() + render a textured quad to FBO + cudaMemcpy() from FBO to CPU would still be faster than using glReadPixels() on the standard framebuffer…

mfort · May 6, 2011, 8:59am

@def - you can transfer GL_RGBA8 render buffers.

system · October 19, 2021, 5:38pm

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.