NV40 glReadPixels..

Curious that there is such a interest in glReadPixels and glDrawPixels all of a sudden.

You guys aren’t expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface…

Originally posted by V-man:
[b]Curious that there is such a interest in glReadPixels and glDrawPixels all of a sudden.

You guys aren’t expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface…[/b]
actually, if you go trough the foriis on the web, the request for it was always there.

and, nutty, you don’t get it, do you? async or sync doesn’t mather if the data transfer rate is just too low. try to get decent fps if you want to readback each frame at full res. you simply can’t, at no real useful res, on any card by now. and this why? for NO reason. if you need to have it nonblocking, and don’t have an extension that does it, use a thread. i’ve done that yet, and it works great. but the raw data transfer rate is that low that you can’t get anything useful out of it.

of COURSE nonblocking async readback helps scheduling it bether, but it does NOT help to gain the NEEDED bandwith to actually get your data back. and there are much scenarios where this is not only useful, but required.

but the biggest reason why it should work is simple: agp defines high-speed readback. it’s in the specs. it’s entierly doable. there is NOTHING preventing it. i BET it doesn’t really take much more work for the vendors.

but now that pcie is underway, i don’t think they really care anymore. how old is agp now? quite an age it has, and during all these years, NO gaming gpu EVER supported the right from the start defined fast readback. depressing.

but the biggest reason why it should work is simple: agp defines high-speed readback. it’s in the specs. it’s entierly doable. there is NOTHING preventing it. i BET it doesn’t really take much more work for the vendors.
That reminds me of something that nVidia’s Matt mentioned back in the early AGP 2x/4x days. He mentioned something about most motherboards not supporting AGP fully correctly, thus requiring various work-arounds or generally poor performance. Some AGP 4x wouldn’t get past 2x performance, even in AGP performance tests. I wonder if it is possible that this is really a motherboard hardware problem more than anything. After all, if none of the AGP implmentations today (as they did before) don’t fully implement the spec, then it’s kinda hard for graphics card vendors to make their cards use that functionality.

I’m not saying that this is the case; I don’t have enough information either way. But, there is some precident for motherboards impeeding the performance of various graphics operations.

Originally posted by V-man:
You guys aren’t expecting a huge leap for sync reads right? Even if it is a NV40 or even on PCI-Ex with a native interface…
What’s the point in having bi-directional bandwidth if it’s not going to be used?

I doubt the NV40 will show much/any improvement but I expect future cards to.

Having said that the Quadro version of the NV40 does appear to have 5x readback performance as I mentioned earler.

That reminds me of something that nVidia’s Matt mentioned back in the early AGP 2x/4x days
That’s correct, but the thread was about people installing their nvidia drivers and in the control panel it says PCI mode instead of AGP something-X
and also the drivers disable write combining on VIA systems.

I think it’s the same situation with ATI and perhaps others.

I don’t know jack about GPU design and nearly nothing about drivers, but it’s quite possible that the reason you get poor readback is neither AGP nor lazy driver writing. It could be that the GPU doesn’t like a glFinish.

Who knows …

Having said that the Quadro version of the NV40 does appear to have 5x readback performance as I mentioned earler.
Could you pin-point the page and location.
I only saw game performance benchmarks.

AGP5x? That means a little over 1GB/s

Keep in mind that if it’s async performance, then it’s not a true measure of how fast glFinish and then read back can occur.

I want to see numbers with sync reads.

I’m not sure where the article was exactly but translate this page into English and search for the word “read”
http://news.hwupgrade.it/12266.html

Async performance would be a silly thing to measure. I really don’t think they are referring to that. Apart from anything else async readback is a driver feature and not specific to the FX4000.

sync or async doesn’t mather. this is just an api issue. performance is a driver/hw issue. why they don’t allow agp mode for readback is beyond my understanding (except the “uh, who cares about that? it costs money to create it, so forget it… nobody needs it anyways, and if someone does, we will have some nice powerpoint presentation showing him otherwise” reason).

first, we need fast readback (delay doesn’t mather for the speed of the readback, just for how long it takes to get the beginning of the transfer. but you all know that)

it’s like me, running a celeron, knowing it’s a fullfletched p4, and i know 90% of it’s power is simply there for nothing. but in my case, it’s because the hw isn’t there (the cache, i mean, is broken). in the case of agp readback, IT’S ALL THERE, THEY WHERE JUST LAZY). that makes rather agressive…

a nice async api, thats a completely different issue. of course that will still be needed/useful.

First: to the people who think glReadPixels is not very useful and doesn’t require much attention: There is a lot more to graphics than games these days… espescially since programmable GPU’s. My particular task is 3d hw rendering with OpenGL. My biggest bottlneck: glReadPixels and I can prove to you it’s not hw, it’s not me, it’s a small amount of really bad code in the drivers.

A glaring issue with glReadPixel that hasn’t received much attention here is data conversion. I.e. pulling 16bit per component float buffers into unsigned short images. Or even flipping the channel ordering of good old 8bit per component…
And in most of these cases you can improve on the driver performance by several orders of magnitude by pulling the data off the card in a different format and doing the conversion yourself… which is of course absurd.

What’s really absurd in this respect is that the people at ATI and NVidia have exchanged more email just with me personally than it would take a summer intern from a sub-par cs department to fix their drivers. Let me show you:

Create a simple glut app with a pbuffer and a timer and do some stuff like this:

resetTime();
glReadPixels(0,0,tw,th, GL_BGRA_EXT, GL_UNSIGNED_SHORT, data16);
millisecondDeltaTime());

What you will see is huge drops in performance for certain data types. Here are some numbers I ran on my 2.5Ghz Athalon with Radeon 9700 and GeForceFX 5200 on 4 channel 720x486 buffers. Check the absurd numbers for 16bit per component on ATI… And BTW doing any kind of conversion on the CPU on these buffers with the most simplistic method imagniable takes no more than 3.5 ms.

for a GeForce FX 5200:

time to send 8bit per component rgba to card: 3.4ms
time to send 8bit per component bgra to card: 3.8ms
time to send 8bit per component abgr to card: 3.3ms
time to send 8bit per component argb to card: 3.5ms

time to read 8bit per component rgba from card: 15.2ms :confused:
time to read 8bit per component bgra from card: 10.7ms
time to read 8bit per component abgr from card: 10.7ms
time to read 8bit per component argb from card: 10.5ms

time to read 8bit per component rgba from card as 32bit float : 11.6ms
time to read 8bit per component bgra from card as 32bit float : 26.0ms :confused:

time to send 16bit per component rgba to card as unsigned short : 19.8ms
time to send 16bit per component bgra to card as unsigned short : 21.9ms
time to send 16bit per component rgba to card as short : 19.9ms
time to send 16bit per component bgra to card as short : 21.6ms

time to send 16bit per component rgba to card as 32bit float : 13.9ms
time to send 16bit per component bgra to card as 32bit float : 19.9ms

time to read 16bit per component rgba from card as short : 20.2ms
time to read 16bit per component bgra from card as short : 22.7ms
time to read 16bit per component argb from card as unsigned short: 53.1ms :mad:
time to read 16bit per component abgr from card as unsigned short: 69.0ms

time to read 16bit per component rgba from card as 32bit float : 23.7ms
time to read 16bit per component bgra from card as 32bit float : 24.3ms

Radeon 9700

time to send 8bit per component rgba to card: 5.6ms
time to send 8bit per component bgra to card: 5.8ms
time to send 8bit per component abgr to card: 40.8ms :mad:
time to send 8bit per component argb to card: 51.9ms

time to read 8bit per component rgba from card: 23.9ms
time to read 8bit per component bgra from card: 19.3ms
time to read 8bit per component abgr from card: 18.9ms
time to read 8bit per component argb from card: 19.0ms

time to read 8bit per component rgba from card as 32bit float : 70.9ms
time to read 8bit per component bgra from card as 32bit float : 70.8ms

time to send 16bit per component rgba to card as unsigned short : 20.5ms
time to send 16bit per component bgra to card as unsigned short : 16.0ms
time to send 16bit per component rgba to card as short : 20.6ms
time to send 16bit per component bgra to card as short : 16.2ms

time to send 16bit per component rgba to card as 32bit float : 21.1ms
time to send 16bit per component bgra to card as 32bit float : 13.0ms

time to read 16bit per component rgba from card as short : 983.7ms :mad: :confused: :mad:
time to read 16bit per component bgra from card as short : 1002.3ms

time to read 16bit per component rgba from card as 32bit float : 985.2ms
time to read 16bit per component bgra from card as 32bit float : 973.3ms

time to read 16bit per component rgba from card as short : 983.7ms
time to read 16bit per component bgra from card as short : 1002.3ms

time to read 16bit per component rgba from card as 32bit float : 985.2ms
time to read 16bit per component bgra from card as 32bit float : 973.3ms
That’s getting a little degenerate :eek: Is there some kind of delaying for-loop going on in there? :wink:

While converting from/to integer to/from floating-point is a “slow” operation, there’s something more going on in that ATi driver than just data conversion, or even data download. Even at PCI download bus speeds, a full second is sufficient to read a few hundred MB of data or so. So it isn’t transfer. And data conversion, as you pointed out, can’t take that long. So it isn’t data conversion. What’s left? Some nonsense?

Maybe ATi has been devoting its driver development resources on other things, and the reading for more… unusual formats (you have to admit, reading 16-bit per component is a bit off the beaten path) is using very old driver code. Like pre-8500 driver code. Stuff written back in the days when ATi’s drivers were really crappy.

What’s really absurd in this respect is that the people at ATI and NVidia have exchanged more email just with me personally than it would take a summer intern from a sub-par cs department to fix their drivers.
I wouldn’t be so sure. Code can get really ugly, especailly if driver development teams have changed a few times over the years. Brutal hacks can form, and so on.

I’m not entirely sure what can be done to improve glReadPixel performance. Let me state that in another way. Both ATi and nVidia make their money off of gamers. As such, these features/optimizations have driver development priority. What would be needed to correct the more atrocious cases would likely be a week or so of one driver programmer’s time, to possibly rearchitect the read-pixel pipeline. Of course, since that time could go to game performance enhancing or supporting features for new hardware, it is likely that it won’t get allocated any time soon.

The best way to get ATi and nVidia on board is to get game developers to try to use glReadPixels. Granted, because it is slow, they won’t do it, so it becomes a catch-22. Either that, or one of the two sides goes ahead and allocates the time to improve performance. This would force the other side to do the same to stay compeditive.

“The Quadro FX 4000 is based on the NV40GL, which is a superset of the NV40 but with additional hardware and software features”
http://www.xbitlabs.com/news/video/display/20040428145354.html

So that probably means that the readpixel performance improvement won’t be seen in consumer cards based on the NV40. Unfortunately readpixel speed tends to be seen as something only useful for ‘professionals’.

I’m just speculating.

Hi,
The ‘GL’ suffix is typical for the Quadro chips, it doesn’t mean they
are so much different. Probably - at manufacture - the gaming chips and
the ‘GL’ chips are 100% identical, then some features are disabled on the
gaming chips, and the DRIVER makes the rest of the difference.
Here are a couple of reasons for the NV40 to have 4x readback:

  1. the PCIE version will have 4x readback, through the HSI; this means
    that the NV40 chip has 4x capability, and hopefully the AGP 8x version
    will take advantage of the capability (if the drivers expose it).
  2. the ATI R423 (PCIE) should have 16x capability (at least theoretical),
    and maybe the R420 has a decent readback capability, so NV40 could look
    bad if it sticks with the current AGP 1x readback.

Ok, but I’d be surprised if NV’s marketing department decided to not mention it in the consumer NV40 release even though it existed. Despite fast readback being of no immediate relevence to consumers, it’s still another bullet point.

Hi,
It seems that the NV45 will have the same readback bandwidth as the NV40, since it’s not a native PCIE, but has the bridge integrated.
Someone from nVidia PLEASE correct me if I’m wrong…