PCI Express - Is anything going to change?

Graphics slots will be PCI Express 16X from the very beginning (at least on the GFX card, I’d double check the mobo you buy… FYI it looks like NVIDIA’s first gen cards will have a bus bridge on the GFX card to support PCI Express probably negating many of the benefits).

It looks like the other system PCI Express slots will be 1X on consumer boards, this is still an improvement over typical consumer mobos current PCI implementations, and each slot has bandwidth independent of the other traffic to other slots so again that’s an improvement.

@bunny: The bandwidth loss of “stopping the bus” is minimal compared to the big stall of reading the query results too early. That’s what you’ve been saying, right? Reading back on the bus stops download bandwidth from being used for the few cycles it takes to read?

If you have big performance problems with occlusion query, like you suggest you have, then it’s much more likely you’re querying too early (or too often) rather than the small loss caused by the AGP reading back.

If you have small-ish performance problems, then it’s likely that AGP bus bubbling is a problem; then you should find another solution to your querying need. I just reacted to your saying that occlusion query is “almost useless”; I don’t think that’s justified at all.

Todays AGP 8x is a high performer and not many games are maxing out the bus during gameplay.
Other types of software that need continuously send large quantities of data to the card (like video replay) might benifit from PCI-Ex

PCI-Ex 16x is nice and all but this is something for the future IMO. It’s not a miracle solution.

And hearing that there will be at most 1 PCI-Ex 16x slot is not an issue for me.
Not many people will have 2(or more) dual head cards in their system for driving 4(or more) monitors.

In fact, on another forum Im on, one guy wants to have 4 monitors with PCI-Ex system.

The drivers may not be perfect, but the AGP bus just isn’t optimised for reading back data, and therefore the fundamental problem lies with the bus

No, the primary problem is the drivers since reading back a dumb block of memory should NOT be slow when you match formats.

The same goes for writing using glDrawPixels. It should not be so slow.
Why create texture and render a billboard whe you can use DrawPixels.

Originally posted by V-man:
[b]Todays AGP 8x is a high performer and not many games are maxing out the bus during gameplay.
Other types of software that need continuously send large quantities of data to the card (like video replay) might benifit from PCI-Ex

PCI-Ex 16x is nice and all but this is something for the future IMO. It’s not a miracle solution.

And hearing that there will be at most 1 PCI-Ex 16x slot is not an issue for me.
Not many people will have 2(or more) dual head cards in their system for driving 4(or more) monitors.

In fact, on another forum Im on, one guy wants to have 4 monitors with PCI-Ex system.

The drivers may not be perfect, but the AGP bus just isn’t optimised for reading back data, and therefore the fundamental problem lies with the bus

No, the primary problem is the drivers since reading back a dumb block of memory should NOT be slow when you match formats.

The same goes for writing using glDrawPixels. It should not be so slow.
Why create texture and render a billboard whe you can use DrawPixels.[/b]
Note the use of the word “fundamental” in my post: a bottleneck in hardware is fundamental because it can’t be worked around; a problem with drivers is easily remedied. I’d be surprised if the problem doesn’t improve with PCI express.

I agree about glDrawPixels though; there seems little reason for that being slow.

o, the primary problem is the drivers since reading back a dumb block of memory should NOT be slow when you match formats.
So, explain precisely how the driver should fix the problem of the bus across which the data is being transfered being excruciatingly slow? Not to mention the fact that said bus does not allowing bidirectional data transfer, thus every glReadPixels call will provoke a glFinish().

This is not a driver problem.

I agree about glDrawPixels though; there seems little reason for that being slow.
My guess with glDrawPixels is that it can be implemented in 2 ways.

One is to directly write pixels to the framebuffer. This, among other things, requires a full glFinish. Also, this probably violates the GL spec because it probably says that per-fragment and post-fragment processing happen on glDrawPixels as well as other methods of rendering.

The other is that, to do a glDrawPixels, they have to internally create a texture (memory alloc. Not fast), load it with your data (memcopy. Also not fast), change a bunch of per-vertex state so that they can draw a quad (state change), draw the quad, and then change the state back (state change).

Ultimately, glDrawPixels is just not a good idea. Hardware’s designed for drawing stripped triangles, not arbitrary bitmaps from main memory.

But glReadPixels performance definately should improve. As long as there’s nothing in the hardware itself (outside of the bus) that prevents it.

jwatte: Like I said, I tried a number of things to reduce the number of queries. I was only able to make small gains. It’s possible by persisting further that I could have improved it more, but the framerate hit in the scene I was rendering was so bad that it just wasn’t worth it. Bear in mind that the usefulness of OC is dependent entirely on the type of scene. In the scene I was rendering, much of the time OC wouldn’t cull much of the geometry at all. SW rast just seems like a more robust solution for what I’m doing, and I’m certainly not alone in coming to that conclusion.

Originally posted by Korval:
[QB] [quote]o, the primary problem is the drivers since reading back a dumb block of memory should NOT be slow when you match formats.
So, explain precisely how the driver should fix the problem of the bus across which the data is being transfered being excruciatingly slow? Not to mention the fact that said bus does not allowing bidirectional data transfer, thus every glReadPixels call will provoke a glFinish().
[/QUOTE]The reason why glReadPixels implies a “glFinish()” is unrelated to the bidirectional data transfer.
If your graphics card has proper support for glReadPixels, the only reason the driver needs to sync the card (“glFinish()”) is because of the synchronous nature of glReadPixels in OpenGL spec. The application must have the data available when at glReadPixel call return time.

What is really necessary is an asynchronous glReadPixels, it’s of little use to have the fastest readback bus in the world if you have to wait idle until the current rendering has finished and your data has returned to the CPU.

By pipelining glReadPixels calls, you should be able to hide most of your latencies.

This is not a driver problem.

[quote]I agree about glDrawPixels though; there seems little reason for that being slow.
My guess with glDrawPixels is that it can be implemented in 2 ways.

One is to directly write pixels to the framebuffer. This, among other things, requires a full glFinish. Also, this probably violates the GL spec because it probably says that per-fragment and post-fragment processing happen on glDrawPixels as well as other methods of rendering.
[/QUOTE]Well that shouldn’t be a problem, because a glDrawPixels is treated as a point wrt texture sampling and color interpolation, so the whole quad gets the same color & texel values.

Anyway, directly writing things to the framebuffer is very bad, and that’s why a function like glDrawPixels - contrary to what you think - is good, because it abstracts the app from the underlying video memory layout and its use doesn’t force a pipeline flush (unlike a buffer “lock”).

The other is that, to do a glDrawPixels, they have to internally create a texture (memory alloc. Not fast), load it with your data (memcopy. Also not fast), change a bunch of per-vertex state so that they can draw a quad (state change), draw the quad, and then change the state back (state change).

There’s a third method and is that the graphics card supports drawpixels natively, where the pixel data is supplied as fragment data, in the same way a graphics card supports “texture downloads” (those seem to be “texture uploads” for non-driver people).
glDrawPixels (or glReadPixels, for that matter) has never been a priority for consumer cards, that’s why you don’t find “fast” implementations of those, but I’m sure you can find them in workstation class boards (DCC applications like Maya perform tons of glDrawPixels/glCopyPixels).

On the other hand, the second method doesn’t need to be slow at all. You don’t need to allocate the texture everytime, you can use a scratch texture, or even a pool of them if you want to be able to pipeline glDrawPixel calls. Loading the texture with the data is a data transfer that you have to do anyway (even in the native-support case), and the state juggling & drawing a quad with that texture once it’s in video memory is fast.

Ultimately, glDrawPixels is just not a good idea. Hardware’s designed for drawing stripped triangles, not arbitrary bitmaps from main memory.

I don’t agree with that, in fact I believe that glDrawPixels is a great tool to avoid having to “lock” framebuffers around and guess which is the format things are really stored into or forcing the hardware vendors to implement a given memory layout.

The reason why glReadPixels implies a “glFinish()” is unrelated to the bidirectional data transfer.
It’s true that glReadPixels had to syncrhonize for other reasons, but this is one reason. Not the only one, but one.

What is really necessary is an asynchronous glReadPixels
Which is what PBO is supposed to offer.

There’s a third method and is that the graphics card supports drawpixels natively, where the pixel data is supplied as fragment data, in the same way a graphics card supports “texture downloads” (those seem to be “texture uploads” for non-driver people).
You’re talking about a graphics card where “fragment data” is something more than the result of per-vertex interpolation and scan conversion. I would imagine that, for most cards, this is simply not a reasonable way to build the card. Fragments are generated from the scan converter and interpolation units directly; there’s no “backdoor” that can be used to insert a fragment.

You don’t need to allocate the texture everytime, you can use a scratch texture, or even a pool of them if you want to be able to pipeline glDrawPixel calls
Or, you can simply not care and simplify your driver development. The 3 applications that really want to do glDrawPixels probably don’t need to do them fast.

and the state juggling & drawing a quad with that texture once it’s in video memory is fast.
Fast? You’ve just swapped out your vertex program, as well as all of its parameters. Let alone any other state that isn’t supposed to effect glDrawPixel operations. Plus, you have to put it back after the operation. State changing isn’t slow just because it’s the application asking for it. It is slow because of the stall bubbles that it provokes into various pieces of the pipeline.

I believe that glDrawPixels is a great tool to avoid having to “lock” framebuffers around and guess which is the format things are really stored into or forcing the hardware vendors to implement a given memory layout
The thrid option is, of course, to simply don’t do it. Don’t do things where you need to do something that only glDrawPixels can do. Bit-from-memory operations are bad; that is why textures live in server memory, not client. glDrawPixels is just a bad idea. Any drawing method that requires any form of virtually direct access to the framebuffer (effectively writing pixels to the FB) like this is a bad idea and should be avoided at all costs.

To chime in on the occlusion query debate…

I found that the calls to occlusion query themselves take up a significant amount of time if done too often (independent of when or how you read back). So any strategy that reduces the number of queries is a good one…

Michael

Originally posted by bunny:
[b]

Also, last time I checked, glReadPixels wasn’t too hot on my geforce 2 pro either, although admittedly, it’s an old card and I haven’t updated the drivers for 6 months. Perhaps the situation is different with newer cards and drivers?[/b]
I don’t think readback performance has changed much with newer graphics hardware. The PDR extension makes a big difference though because it allows readpixels to work asynchronously so readpixels (in some scenarios) is almost free.

If we were talking about a 10% difference in performance then I would agree with you but the difference is very large. If only one vendor has fast readback then it won’t be used in games which is why all vendors should at least try to be close in terms of performance/capability. Even in its currently slow state readpixels is useful. PDR and NV’s hardware/drivers have made it much more viable.

Faster readback with pci express has the potential to radically change the way graphics engines are written imo.

Since occlusion queries are not bandwidth-bound, the only difference I can see between an AGP/PCI implementation and PCI-Express is lower latency due to less bus turnaround and synchronization. The rest is in the GPU and driver.

It would be nice if occlusion queries were more abstracted so you can actually get whatever rendering statistics about whatever part of the pipeline you wanted. Self-tuning engines would be interesting. Maybe you can also have a fragment-shader-writable/queriable accumulators available as well, but that’s definitely going off-topic…

The status of the pixel pack path is pretty annoying. NVIDIA supports asynchronous but only 64-bit PCI writes. 3Dlabs supports 4x AGP writes but only synchronously. ATI supports neither. Who knows; the driver can be pulling the data off the card one pixel at a time. Also, these implementations typically only have their fast paths for UNSIGNED_CHAR RGBA, BGRA, and sometimes RGB and BGR. That means you have to do all the pixel formatting yourself in software or in a fragment program.

However, all of these guys know that people are interesting in optimizing glReadPixels; everytime I mention it to developer relations, they claim that they’ve been hearing it more and more. Keep clamoring. I would be surprised if there WEREN’T a fast, asynchronous glReadPixels implementation available within a year. For 3Dlabs, it really is just a matter of driver development. Maybe ATI’s new PCI-Express cards will have it out of the box. Of course, that doesn’t really help those of us who need it now, so we compromise.

I bet that we’ll also have fast glReadPixels on AGP, as well. Say NVIDIA decides (when they make their PCI-Express native cards) to implement fast bidirectional transfers. When they slap on their HSI bridge, you have fast AGP writes and glReadPixels.

An interesting thing about PCI-Express is the potential to have multiple fast-bus video cards operating in tandem. Of course, AGP 3.0 has this as well, but no one seems to care yet, so we’ll see. It’ll get to be a bigger issue when people really start performing computation on video cards, which requires fast glReadPixels to some extent.

-Won

The problem with optimizing glDrawPixels or glTexImage etc is that the OpenGL spec allows all sorts of data types format conversions alignment and swizzle operations including memory strides, offsets etc. Even LUTs and convolutions are there in the imaging extensions. It is not straightforward. Sure to set this up for a simple case should be fast & easy, but getting the kind of coverage to reliably support various formats and types (internal and external) with different memory alignments etc etc, is probably a huge pain in the ass. So you get fast coverage for some common stuff and some fallback code path for a lot of other stuff unless it suddenly becomes more important than figuring out how to cheat at the next Futuremark benchmark. It is improving steadily (it seems to me), it used to really suck.

That’s life. I’m still glad I can buy the fastest most programmable graphics on the planet for $500 at “Best Buy”.