GF3 Z-occlusion performance? (Sorry, not OpenGL releated)

Originally posted by Humus:
Btw, I don’t know if the reasoning that Radeon should benefit more because it has less bandwidth actually holds true. You must remember that the Radeon only has two pipelines to feed while GF3 has four. Bandwidth / rendered pixel is higher on the Radeon than on GF3.

This is true if you are comparing the Radeon to the GF2, but the Nvidia wised up for the GF3 and equipped it with 4 memory controllers. This means there is potentially less bandwidth wasted per clock since you have the memory in more managable 32-bit chunks. Less wasted memory means that you can get closer to the theoretical limits of the adapter.

FUnk.

Originally posted by Nutty:
[b] Humus, you aint gonna get that unless someone breaks their NDA with nvidia. Divulging the inner workings of how their implementation works, is a guaranteed way of getting a slap!

Nutty[/b]

Maybe I should have clearified. I’m not interested in exactly how it’s implemented hardware wise, I don’t care how the wires goes and how they have packed their transisitors. I just want to know the basic concept of GF3’s Zocclusion, if it’s using tiles or is doing it per pixel. I mean, it’s nothing that can ever be useful for their competitors to know. It’s like asking what size it’s vertex cache is. It’s not gonna change the competitors design decisions, but it may change game developers design decisions.

Originally posted by Funk_dat:
[b] This is true if you are comparing the Radeon to the GF2, but the Nvidia wised up for the GF3 and equipped it with 4 memory controllers. This means there is potentially less bandwidth wasted per clock since you have the memory in more managable 32-bit chunks. Less wasted memory means that you can get closer to the theoretical limits of the adapter.

FUnk.[/b]

Ok, I can understand how you’re thinking. But it only explains why Radeon has a higher percentage gain, not why it has a higher gain in absolute numbers. And it really doesn’t explain why GF3 with it’s more than twice as high fillrate, much faster memory and more sophisticated (I guess) memory subsystem gets beaten by the close to one year old Radeon even though they have supposedly similar Zbuffering optimizations. Thus it got to be some fundamental difference between the way the two are handling it, and that GF3 would be culling on per pixel basis made perfectly sense to me and explains it all aswell as why GF3 doesn’t benefit so much from it’s FastZClear as stated by some nVidia employee somewhere … until Matt said that “Dave [who told it too me initally] is confused”.

your right…I’m not sure why the tests came out the way they did. I wouldn’t make any final judgements, though, until the hardware has actually been released, the drivers have been finalized, and an appropriate test has been run. The test ace’s hardware ran were kinda old(they even had a disclaimer saying the tests might not be accurate). The Q3 Quavier benchmarking demo might be a good one to use.

Funk.

I just want to know the basic concept of GF3’s Zocclusion, if it’s using tiles or is doing it per pixel

I thought we already established that it’s doing it per pixel. Given the fact that the Geforce series cards are not tile renderer architecture boards.

AFAIK it is per pixel. I could tell you what I was told about it’s functionality, but technically I’m still bound under NDA, and therefore shouldn’t. TBH it’s not really that relevant to developers. I wouldn’t worry about it.

Nutty

I don’t even know what “per pixel” means in this context.

Nutty, it sounds like you got some bad info too.

I absolutely hate the fact that the world “tiling” has been completely misinterpreted by all the web sites to mean something completely different than what it really means.

“Tiling” is just a memory layout! It has absolutely nothing to do with rendering architecture – nothing at all.

The proper name for such rendering architectures is “chunkers”, i.e., they batch up the scene into chunks, and they render one chunk at a time rather than one triangle at a time. (This also clarifies the fact that if a scene is too large, more than one chunk may be required.)

Now, a “tile” is just a name for a rectangular collection of pixels.

Am I going to say which operations we do exactly where? Absolutely not.

If someone outside of NVIDIA claims to know those kinds of details about how our rendering architecture works, they’re probably lying. A lot of people in our company don’t know.

Virtually everything I’ve seen in this thread is either (1) misinformation or (2) speculation.

As developers, all you need to know is one simple rule: draw front to back, always.

  • Matt

The test ace’s hardware ran were kinda
old(they even had a disclaimer saying the tests might not be accurate). The Q3 Quavier
benchmarking demo might be a good one to use.

I think this site already done it…
http://www.digit-life.com/articles/gf3asusagpv8200deluxe/q3-32-quaver.gif
http://www.digit-life.com/articles/gf3asusagpv8200deluxe/index.html

JackM

[This message has been edited by JackM (edited 05-02-2001).]

Originally posted by Humus:
And it really doesn’t explain why GF3 with it’s more than twice as high fillrate, much faster memory and more sophisticated (I guess) memory subsystem gets beaten by the close to one year old Radeon even though they have supposedly similar Zbuffering optimizations…

Just taking a QUICK look over the site, its seems to me the GeForce 3 severly dominates over the kyro 2 and does a pretty number on even the Radeon on EVERY test except villiage mark. Interestingly Villiage Mark was (as stated) create by Imagination Technologies. Seems to me they have everything to gain by making the GeForce cards look AS BAD AS POSSIBLE. I might suspect that, not only did they tune the app to make the best use of their card, but they also might have gone out of their way at every opportunity to use every feature/renderstate/technique that would slow the GeForce down more than the Kyro. If doing something in an untraditional way gave a 5 percent hit to the GeForce but only a 2 percent hit to the Kyro, then do it that way regardless of whether it is inconsitant with the way 99% of apps do it. If on the other hand, something else gives a bigger hit to the Kyro, well, they might just conveniently not do that.

As for the Radeon, it might just be coincidence that the Radeon happens to do OK at these non-traditional things than the GeForce. You need to remember that 95% or more of games/apps render with pretty much the same styles/methodologies. Also remember that as you tune something to perform well at one task, it generally begins to perform worse at other tasks, sometimes even performing worse at obscure tasks than a completely generalized/unoptimized model. The GeForce 3 could theoretically be more tuned to the way REAL apps do things than the Radeon is, and when you get to something obscure (which Villiage Mark may be doing) the radeon may just happen to perform better because it is a more generalized/unoptimized solution.

Again, this is not based on any knowledge of either card or of Villiage Mark, just on my speculation of why that one benchmark stands out like an eyesore among the other benchmarks.

You got a point there, LordKronos…just noticed that VillageMark is created by Img Tech

But…I think chunking architecture have potential…future games (aka Doom3) will have a huge overdraw factor, and KyroII is very efficient at handling it.

Jack

[This message has been edited by JackM (edited 05-02-2001).]

I love this thread … old wars coming back
I think you’ve gone too far. Humus just saw a weird behavior of the latest high tech graphic card versus some older ones, and wanted to have a technical explanation to this. That’s all.

Whooha … lots of replies!
Ok, before I start to answer to all and everyone I must say that Paddy is right, it’s not like I’m trying to prove there’s something wrong with GF3.

Originally posted by Nutty:
[b] I thought we already established that it’s doing it per pixel. Given the fact that the Geforce series cards are not tile renderer architecture boards.

AFAIK it is per pixel. I could tell you what I was told about it’s functionality, but technically I’m still bound under NDA, and therefore shouldn’t. TBH it’s not really that relevant to developers. I wouldn’t worry about it.

Nutty[/b]

“Established”, well, not entirely but sounds likely ATM. It also makes sense since it would require a lot of work to redo the whole architechture that a switch to tiled method requires. And since the S3TC bug still seams to be present in GF3 according to
many sites, I guess they’ve reused much from GF2.

Originally posted by mcraighead:
[b]I don’t even know what “per pixel” means in this context.

Nutty, it sounds like you got some bad info too.

I absolutely hate the fact that the world “tiling” has been completely misinterpreted by all the web sites to mean something completely different than what it really means.

“Tiling” is just a memory layout! It has absolutely nothing to do with rendering architecture – nothing at all.

The proper name for such rendering architectures is “chunkers”, i.e., they batch up the scene into chunks, and they render one chunk at a time rather than one triangle at a time. (This also clarifies the fact that if a scene is too large, more than one chunk may be required.)

Now, a “tile” is just a name for a rectangular collection of pixels.

Am I going to say which operations we do exactly where? Absolutely not.

If someone outside of NVIDIA claims to know those kinds of details about how our rendering architecture works, they’re probably lying. A lot of people in our company don’t know.

Virtually everything I’ve seen in this thread is either (1) misinformation or (2) speculation.

As developers, all you need to know is one simple rule: draw front to back, always.

  • Matt[/b]

Basically, with “per pixel” I mean that it takes a pixel, loads it Z buffer value and checks it. If it’s in front it renders it, otherwise goes on to the next.
With “tiled” I mean that is uses some small block, as in Radeons case 8x8 if I’m correctly informed. So it takes tile, check the plane of the polygon that it’s going to render to that tile and calc it’s min & max depth within the tile, check against the cached max & min of the tile and decide whether to cull, render normally or render with writeonly depth.

As a developer though, while the front-to-back rule is the most important, the implementation details may also be important when making design decisions since the obviously perform very differently under different conditions. The “tiled” version will perform better under normal conditions where most polys covers more than a whole block, while the “per pixel” version would perform better with extremely high polygon count where every polygon is smaller than the tile.

Originally posted by LordKronos:
[b] Just taking a QUICK look over the site, its seems to me the GeForce 3 severly dominates over the kyro 2 and does a pretty number on even the Radeon on EVERY test except villiage mark. Interestingly Villiage Mark was (as stated) create by Imagination Technologies. Seems to me they have everything to gain by making the GeForce cards look AS BAD AS POSSIBLE. I might suspect that, not only did they tune the app to make the best use of their card, but they also might have gone out of their way at every opportunity to use every feature/renderstate/technique that would slow the GeForce down more than the Kyro. If doing something in an untraditional way gave a 5 percent hit to the GeForce but only a 2 percent hit to the Kyro, then do it that way regardless of whether it is inconsitant with the way 99% of apps do it. If on the other hand, something else gives a bigger hit to the Kyro, well, they might just conveniently not do that.

As for the Radeon, it might just be coincidence that the Radeon happens to do OK at these non-traditional things than the GeForce. You need to remember that 95% or more of games/apps render with pretty much the same styles/methodologies. Also remember that as you tune something to perform well at one task, it generally begins to perform worse at other tasks, sometimes even performing worse at obscure tasks than a completely generalized/unoptimized model. The GeForce 3 could theoretically be more tuned to the way REAL apps do things than the Radeon is, and when you get to something obscure (which Villiage Mark may be doing) the radeon may just happen to perform better because it is a more generalized/unoptimized solution.

Again, this is not based on any knowledge of either card or of Villiage Mark, just on my speculation of why that one benchmark stands out like an eyesore among the other benchmarks.

[/b]

Well, I can say that VillageMark is NOT written to in any way to run worse on other platforms than Kyro. It’s written to show how wonderful the deferred rendering of Kyro cards are, but it’s not written in a way to intentionally make it run slower than it could on other cards. In fact, it uses T&L even though none of the Kyro cards supports it. If they wanted to make it look as bad as possible on other cards they should have draw back to front, which they have confirmed that it isn’t doing (which is also obvious by looking at {Radeon | GF3} vs. GF2 scores), it draws in more of an random order.
It would of course be better with a third party benchmark, but there are no benchmarks except this one that can show how well a card can handle overdraw.

Originally posted by Humus:
Well, I can say that VillageMark is NOT written to in any way to run worse on other platforms than Kyro… In fact, it uses T&L even though none of the Kyro cards supports it…

Yes, and hasn’t it been shown several times that in an artificial benchmark, where the app is doing nothing but throwing polys (which I assume Villiage Mark is…I havent seen it) that a fast CPU can score better than when the GPU does the T&L? Perhaps T&L was included for this reason. The point is, you cant just take something and turn it into a blanket statement saying “well, they even used a feature that they dont have, so they obviously werent trying to make anyone else look bad”

If they wanted to make it look as bad as possible on other cards they should have drawn back to front, which they have confirmed that it isn’t doing

Also remember that a card like the radeon can take advantage of strict back-to-front rendering. If you go strictly back-to-front, there is a high probability that each polygon rendered will be closer than the zMin for the corresponding z-tile, therefore the card can use a write-always mode (instead of read-compare-write) when updating the z-buffer. So they confirmed it wasnt back to front. Perhaps at the beginning of each frame, they just throw in enough close-up, tiny polys so as wreck havoc on any cards z-tiling, then they go back-to-front. If they did this, they would still be telling the truth.

So why would the Radeon perform better than the GeForce 3? Even IF the GeForce 3 had z-tiling (which I dont know if it does), perhaps it uses a tile size that is better optimized for what most apps do. In a realistic app, it might be better (Im just speculating) to have a 16x16 tile size rather than the radeon’s 8x8. Then for larger occluded poly’s it could discard more fragments with fewer tests than you could on an 8x8 tile system. If this were the case (and again…Im just making it up), a 16x16 pixel z-tile system would be more susceptible to a malicious benchmark (one that throws in a few tiny close-up polys, then renders back to front) than an 8x8 tile system would be, because a few “malicious” polys would “corrupt” a higher percentage of the tiles.

When any company writes a benchmark to show off their own product, they are counting on you to make these type of broad assumptions that “oh, their test must be valid, because otherwise they would have done … instead”

And again, I just want to clarify, I’m not making any type of statement about the benchmark…I really dont know much about it. I’m just trying to play devil’s advocate here and point out what the benchmark MIGHT possibly be doing to give the Kyro card the advantage.

Well, the purpose of the benchmark is to show the advantage of deferred rendering on Kyro cards. The reason they included T&L is probably to show that it doesn’t have as much impact or something, but then on the other hand VillageMarks doesn’t exactly contain a whole lot of polys. But there are really no reason to think that they have intentionally written the application in such a way that it should perform badly on other cards, except for the fact that it’s overdraw is huge … but that’s also the point of the benchmark. They try to show that as overdraw increases the advantage of deferred rendering is huge, and since overdraw will increase over time Kyro must be the card of the future … sort of. So, you should be taken with a grain of salt, just as you should with TreeMark and similar benchmarks, but I don’t see any reason to believe that they’ve intentionally made efforts to make it slow on other cards. In fact, the claim that it’s drawn in random order can easily be proven on Radeon cards by comparing HierarchicalZ on and off score. The performance difference is quite large between the two.

I just wanted to comment on TreeMark, since it was mentioned…

TreeMark is basically a DrawElements performance test. I don’t know if we ever released its code, but things don’t get much more straightforward – set up the vertex arrays (vertex, normal, texcoord), set the T&L state, and render some geometry using DrawElements.

  • Matt

Bringing this topic up again with some interesting results.
I’ve written a small benchmarking utility (which can be found here: http://hem.passagen.se/emiper/3d.html),,) and I’ve have some overdraw test with fixed overdraw factors (3 and 8). I have three drawing modes, strictly back to front, strictly front to back and random order. I posted this on the forums over at beyond3d.com and the results people had was quite interesting. It seams that indeed GF3’s Zocclusion isn’t any less efficient that Radeons HyperZ. GF3 performed around 3-3.5x as good in front to back than back to front with overdraw factor 8. Radeon gained 2.5-3x.
Why Radeons perform so good in VillageMark though I’m not sure, but I recall that it uses 3 texture layers which may be an important factor.

Anyway, another interesting result as showed by peoples posts over at beyond3d.com that you Matt may be interested in is that while every other card got pixel & texel fillrate results very close to their theoretical values GF3 didn’t. It got around 600Mpixel/s and 1200Mtexel/s, while GF2 cards would get close to 800Mpixels/s and 1600Mtexel/s. I guess it may be a driver issue, perhaps not doing a page flip but rather a buffer copy?

Talking about benchies …
I made a little fillrate benchmark by overdrawing semi transparent polygons.
No HyperZ here, it’s even almost CPU independant (I score same on my new Athlon 1ghz than my old PII-300). The whole compiled displaylist is executed during 10 seconds.
http://paddy.io-labs.com/rtfog.zip