Slow transfer speed on fermi cards

yes, that’s right, glReadPixels must block. Therefore you are not measuring performance of DMA transfer but you measure performance of rendering + DMA transfer. That is the problem of the benchmark.

True, though the render time for something so trivial should be negligible overhead.

However, if there was a card nowadays where rendering a cube was really, really slow (1-5 ms), then I’d agree your point would result in a significant timing difference. But as-is the overhead should be pretty small.

And just to confirm here, I made this change (putting a glFinish() before sampling the start time) and did not see any timing difference.

I put that in on short notice just to avoid the usual “your timing is not correct” reply. Unfortunately for me I confused it with glFinish().
And yes, theoretically it should be called just before and just after measuring a GL call. Who knows maybe someone wants to benchmark software rendering…

your application crashes on my laptop with radeon 7500 mobility.

Assuming this isn’t a spambot, of course it doesn’t work with a Radeon 7500, which is an OpenGL 1.3/Direct3D 8.1 card from about the year 1437. :slight_smile:

Out of curiosity I tried this on my laptop’s 230M and got comparable results. However, switching format from GL_RGBA to GL_BGRA caused performance to almost double. Changing the type from GL_UNSIGNED_BYTE to GL_UNSIGNED_INT_8_8_8_8_REV gave a more subtle increase, but I’m assuming that the driver is recognising it’s a 32-bit format and optimizing accordingly.

It would be interesting to see benchmark results with a changed format and type for the troublesome hardware.

Attached updated transferBench with these mods (and Linux Makefile): transferBench_src4.zip

Results:

GTX285, 260.19.04b drivers, 2GHz Nehalem EP CPU:[/b]

glReadPixels: 1.88 ms
PBO glReadPixels: 0.92 ms (memcpy: 1.64 ms) total: 2.55 ms
glTexSubImage2D: 3.55 ms
PBO glTexSubImage2D: 0.07 ms (memcpy: 1.67 ms) total: 1.74 ms
glCopyTexSubImage2D: 0.02 ms
glGetTexImage: 8.60 ms

memcpy speed: 2252 MBytes/sec

Total frame: 23.69 ms (total transfer: 14.80 ms)

GTX480, 260.19.04b drivers, 2GHz Nehalem EP CPU:[/b]

glReadPixels: 4.82 ms
PBO glReadPixels: 3.47 ms (memcpy: 1.12 ms) total: 4.59 ms
glTexSubImage2D: 4.74 ms
PBO glTexSubImage2D: 4.92 ms (memcpy: 1.11 ms) total: 6.03 ms
glCopyTexSubImage2D: 0.08 ms
glGetTexImage: 9.97 ms

memcpy speed: 3303 MBytes/sec

Total frame: 37.81 ms (total transfer: 25.49 ms)

So faster than before, but still a 2.6X slowdown on GTX480 vs. GTX285 (before, was 3.8X slowdown).

So faster than before, but still a 2.6X slowdown on GTX480 vs. GTX285 (before, was 3.8X slowdown).

I guess this improvement is only due to faster memcpy (which I cannot explain).

Here is proof: 2.6 / 3.8 = 2252 / 3303

Could anyone download CUDA-Z (google it) and run in on both 480 and 285 cards in the same PC? There is a memory performance statistics when using CUDA. It should match the OpenGL (with and without PBO).

I thought I would jump into the fray. I too have been chasing this problem for about a week or so. Bought a 465 to test with our s/w. Linux, Centos 4.7. Our 9800 and 280 GTX cards ran circles around the 465 until we stopped the glread stuff, then the render speeds made sense.

I have been hanging out at nvidia’s site trying to get answers… no luck. Wanted to try CUDA Z, but my OS libraries are out of date. So I am building a new OS disk… Anbody have anything new to report? --Mike

For what it is worth, I also tested a GTX 480 and got the same reuslts

My card is a GTX 460 which uses the Fermi GF104 GPU, a derivative of the Fermi GF100 GPU used in the GTX 465/470/480. This card unfortunately also suffers from slow glReadPixels speed, and seems to be a lot worse than the GF100 cards. :frowning: In transferBench, I get 22ms speed for glReadPixels in the beginning, but the odd thing is that if I let it run for about half a minute, it will improve to 18ms but at the same time, PBO glReadPixels and glGetTexImage become slightly slower by about 2ms.

As for memcopy speeds comparison between transferBench and CUDA-Z, CUDA-Z gives slower speed (5900MB/s Pinned, 4700MB/s Pageable) compared to transferBench (6900MB/s) for my GTX 460.

Well, turns out that I had Vsync on when I ran transferbench last time and got 22ms for glReadPixels. I tried it again after turning off Vsync this time and got much better speed at around 8ms. I guess it’s still slow though.

Just out of curiousity I tried installing the Quadro 260.78 beta drivers for my GTX 460 by modding the INF file, and I’m sad to say that it changed nothing at all with regards to glReadPixels speed. So, either it’s a hardware limitation in the 400-series cards, or the Quadro drivers is smart enough to know that it’s not running on an actual Quadro card and therefore doesn’t enable the Quadro-specific performance boost. I think it’s the latter.

GTX480, 261. dev drivers,devdriver_3.2_winvista-win7_64_261.00_general

Nvidia Corp/GeForce GTX 480/PCI SSE2 4.1.0
Card MFG: ASUS

Intel Core i7 x980 3.33GHz 12.0GB RAM Win7 64bit
Quote:
glReadPixels: 7.11 ms
PBO glReadPixels: 2.32 ms (memcpy: 0.46 ms) total: 2.77 ms
glTexSubImage2D: 1.09 ms
PBO glTexSubImage2D: 0.04 ms (memcpy: 0.50 ms) total: 0.52 ms
glCopyTexSubImage2D: 0.04 ms
glGetTexImage: 3.93 ms

memcpy speed: 7923 MBytes/sec

Total frame: 18.74 ms (total transfer: 14.91 ms)

Anyone heard anything back from Nvidia about this major problem?

THANK YOU!
Wayland Strickland

GTX 580 card result : GLreadpixel = 4.8ms

Any comments?
My radeon 5850 mobility has 3.8 ms…

For reference MobilityRadeon4530 W7x64 HP DV7 laptop

glReadPixels: 5.86 but goes to 6 temporarely
PBO glReadPixels: 5.29 ms (memcpy: 1.62 ms) total: 6.92 ms
glTexSubImage2D: 4.06 ms
PBO glTexSubImage2D: 0.18 ms (memcpy: 1.85 ms) total: 2.03 ms
glCopyTexSubImage2D: 0.05 ms
glGetTexImage: 4.84 ms

memcpy speed: 2269 MBytes/sec

Total frame: 36.90 ms (total transfer: 19.71 ms)

I’ve made my own testing.
I tested GeForce GTX 260, GeForce GTX 460 on the same computer (Xeon based). Driver 260.99.

glReadPixels is about 10 times slower on GTX460 then GTX 260.
glReadPixels with PBO is about 2.5 slower on GTX460 then GTX260.

glTexSubImage with PBO is about the same on both GTX 260 and GTX 460.

Then I tested CUDA with OpenGL. I copied renderbuffer content from GPU to CPU memory using CUDA. I used plain memory (not page locked, alias pinned).
The CUDA performance on GTX 260 was about the same as glReadPixels+PBO.
The CUDA performance on GTX 460 was 2.5 higher then ReadPix+PBO and equal to performance on GTX 260 !!!
If anybody wants to see the CUDA code, I can post it here.

Conclusion
The GTX 460 is capable of transferring data from GPU to CPU at the same or higher speed then older GTX 260. There is no HW limitation. Current OpenGL driver cannot utilize full speed transfer.

It implicates two options:
A: There is a driver bug and NVIDIA will fix it one day.
B: This behaviour is done by purpose (Does anybody has fermi based Quadro?). I think the Fermi based Quadro will not suffer this performance lost.

NVIDIA, tell us the truth please.

Well, luckily I bought the cheapest Quadro Fermi yesterday, and it is much faster than a high-end GeForce Fermi for ReadPixels and GetTexImage:

i7 960 at 3.2GHz, Vsync forced off. Both of these cards are in the same machine at once.

Quadro 600 (Yes, not 6000) as a headless PCIe 2.0 at 8x[/b]

glReadPixels: 1.65 ms
PBO glReadPixels: 1.18 ms (memcpy: 0.77 ms) total: 1.95 ms
glTexSubImage2D: 2.60 ms
PBO glTexSubImage2D: 2.95 ms (memcpy: 0.78 ms) total: 3.73 ms
glCopyTexSubImage2D: 0.07 ms
glGetTexImage: 7.65 ms

memcpy speed: 4784 MBytes/sec

Total frame: 26.95 ms (total transfer: 15.05 ms)

GTX 470[/b] as the primary monitor:[/b]

glReadPixels: 23.21 ms
PBO glReadPixels: 2.43 ms (memcpy: 0.83 ms) total: 3.26 ms
glTexSubImage2D: 2.63 ms
PBO glTexSubImage2D: 2.69 ms (memcpy: 0.67 ms) total: 3.35 ms
glCopyTexSubImage2D: 0.06 ms
glGetTexImage: 12.81 ms

memcpy speed: 4455 MBytes/sec

Total frame: 50.11 ms (total transfer: 42.70 ms)

And a quote from http://developer.nvidia.com/object/opengl_driver.html

  1. Will functionality marked as deprecated be slow on NVIDIA hardware?

No. NVIDIA understands that features on the deprecated list are critical to the business of a large part of our customer base. NVIDIA will provide full performance, and will support, tune, and fix any issues, for any feature on the deprecated list. This means that all the functionality in the ARB_compatibility extension and Compatibility profile will continue to operate at maximum performance.

This thing (the disable of ‘professional’ features, or the crippling of the actual hardware capabilities depending on how you want to see it) seems to be happening with every new generation of hardware…
I can understand custom profiles for some applications, but the crippling of some selected set of features is another thing completely.

I will post the results of the transferbench on a quadro 3800 in a few days.

I do not think that depreciation has something to do with slow transfer. Even the transfer with PBO (which is not deprecated) is slow.

I’ve already tested Q FX 3800. It is fast (even faster then GF). But this card is not Fermi based. The problem is only with Fermi cards.

Fermi GeForce. Fermi Quadro is fine.

Hi,
i experimented with the source code from this thread to investigate download rates. I am on a GTX 480 with r266 drivers. I noticed a very strange and disturbing effect. After letting the benchmark run for more then a few seconds i noticed that the PBO glReadPixels dropped in performance. I nearly doubled the download time, from ~2.5ms to ~4.8ms. This happens reproducible every time after 10-15s runtime.

I suspect the driver moving the buffer object to another memory region after some usage analysis, which is actually worse than the first one. I came across this behavior of the nvidia drivers some time ago on another project, but never found a way around it…

Has anyone else noticed this behavior on other GPUs and drivers?

I commented out the other parts of the benchmark, this is the code i used:


	// PBO glReadPixels
	QueryPerformanceCounter(&start_ticks);
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, pbo[0]);
	glBufferData(GL_PIXEL_PACK_BUFFER_ARB, 1280*720*4, NULL, GL_STREAM_READ);
	glReadPixels(0, 0, 1280, 720, GL_RGBA, GL_UNSIGNED_BYTE, BUFFER_OFFSET(0));
	//glReadPixels(0, 0, 1280, 720, GL_BGRA, GL_UNSIGNED_INT_8_8_8_8_REV, BUFFER_OFFSET(0));
	void* mem = glMapBuffer(GL_PIXEL_PACK_BUFFER_ARB, GL_READ_ONLY); //blocks tilt data is available
	glFlush();
	QueryPerformanceCounter(&ende_ticks);
	pbo_rp_ms = ((double) ende_ticks.QuadPart - (double) start_ticks.QuadPart) / frequenz.QuadPart * 1000.0;

	QueryPerformanceCounter(&start_ticks);
	memcpy( dump1, mem, 1280*720*4 );
	QueryPerformanceCounter(&ende_ticks);
	rp_memcpy_ms = ((double) ende_ticks.QuadPart - (double) start_ticks.QuadPart) / frequenz.QuadPart * 1000.0;

	glUnmapBuffer(GL_PIXEL_PACK_BUFFER_ARB);
	glBindBuffer(GL_PIXEL_PACK_BUFFER_ARB, 0);