OT: GeForce3 slow?

Yes, dorbie, I’ve been a bit too early lucky about my GF’s speed…

I’ve just updated my Via drivers, but it doesn’t make any difference…

There’s a section in the GeForce FAQ about enabling fast writes ( it needs to be enabled in the Windows registry ),
http://www.tweak3d.net/faq/faq.cgi#sw:drv:fw

I just tried this - wcpuid3 reported fast writes as supported and enabled ( reboot was required after using the .reg file ).

The performance on the Athlon system did not change strangely enough.

I might be wrong, but I read something ages ago, saying nvidia drivers refused to use SBA/fast-writes on AMD systems, regardless of how it was setup, due to stability problems.

Although, turning fast-writes on, on my machine causes a guaranteed lockup in about 10 minutes.

I’m quite confused as to the benmark results. 24million on a GF2 was it? I only get 38million on my gf4. I thought it would be loads more, but it wasn’t. That was tri-strips with AGP4x.

Nutty

A somewhat related tech tip: if you’re having problems with AGP in windows 2000, be sure to install the first and second service packs (at least). I was having completely bizarre texture glitches in just a few programs, repeatedly, and neither installing new graphics drivers nor tweaking the settings nor completely reinstalling the OS did any good, but installing the service packs fixed the problem.

I thought I’d have to wait longer to see the word “only” and “38 million” used together.

> I only get 38million on my gf4

Try sending vertices that only have x/y/z, each as a 2byte SIGNED_INT. Once you turn off rasterizing and load the simplest possible vertex shader (or use the fixed pipeline in vanilla mode) you may be vertex transfer bound, at which point sizeof(vertex) starts to matter.

OK I used the geforcetweak utility here http://www.geforcetweak.com/ to enable fast writes and a few other goodies and I now get 27 million where I used to get 16 million on my geforce3. The registry patch for fast writes didn’t do it for me. Thanks for the advice.

I also benchmarked a GeForce4 Ti4600 and it only got ~17 million. Obviously it needs the tweaks too.

Bruno, I’m curious, which OEM supplied your card, and did you apply any tweaks or registry edits beyond installing drivers? What’s your modo?

[This message has been edited by dorbie (edited 05-14-2002).]

I have an ASUS GeForce3 but I have never used their drivers ( always NVIDIA’s reference drivers ). Maybe that’s one reason to start using these drivers ( I doubt it but just maybe ). I’ll try the tweak utility too and see if that helps.

I thought I’d have to wait longer to see the word “only” and “38 million” used together.

Yeah, but it’s a far cry from the 100-odd million it can supposedly do. Given that the optimal tri-strips should equate to 1 tri per vertex, it only had a vertex throughput of 38million also.

I can only assume that Jwatte is correct and I’m transfere bound.

Anyone know if switching from SDR to DDR main ram has any impact on AGP throughput?

I see to re-call a thread where someone was trying to achieve the theoretical figure of the GF4’s but they changed the method of attaining it from GF3 to 4. Which IMHO is quite dirty. For the same reason that I loath CD-rom manufacturers! 50x speed!! yeah right!! But dont get me started on that.

Nutty

You want 100-something million? Try this:

  • Use trianglestrips
  • Use VAR
  • Make all indices the same

Done.

This will give you the maximum number of triangles the trianglesetup unit (in this case better called ‘backface/degenerate culler’) can process.

Using BenMark on a GF4 I only get about 37 million. I’ve been able to get 54 million using my own OpenGL-based test.

– Tom

Originally posted by Nutty:
[b]I see to re-call a thread where someone was trying to achieve the theoretical figure of the GF4’s but they changed the method of attaining it from GF3 to 4. Which IMHO is quite dirty. For the same reason that I loath CD-rom manufacturers! 50x speed!! yeah right!! But dont get me started on that.

Nutty[/b]

What, you don’t like having 50x speed for the last 10 secs of a CD?
At least the P-Cav drives are starting to appear now.

DDR system ram vs SDR system ram does play a role with AGP throughput, since AGP4x +PC133 = ~1066MB/sec, and DDR should be close to ~1200MB/sec (or more?)

I think that Fastwrites should be on by default for all ref. drviers, and that includes both nvidia, and ATI cards, and I recall nVidia saying, to force sidebanding and fastwrites to coexist, is not only redundant, but may present stability issues and performance loss.
Besides, SBA is only AGP2x, not 4x right ?

Originally posted by Tom Nuydens:
[b]Using BenMark on a GF4 I only get about 37 million. I’ve been able to get 54 million using my own OpenGL-based test.

– Tom[/b]

Yes, I too got around 60 M polys/sec , with display lists and large triangle strips. Of course this is also 60 M vertices/sec. With VAR and 'tight’mesh you can get ‘more’ because of vertex re-use and the post-T&L cache.

Dorbie,

>Bruno, I’m curious, which OEM supplied your >card, and did you apply any tweaks or >registry edits beyond installing drivers? >What’s your modo?

My card is an Asus., it was one of the first to show up in the market., i got it a Taiwan store in July 2001 or something…,
Yeah, i use the tweakutility too.
At first i remember i was really worried because my 3dMark was really slow, and i wasn’t even able to do the tests because somehow i couldn’t get Direct3d to work correctly with it., but as drivers started to appear, i got it working.

Bruno

edit: <this is part 1>

The methos of attaining peak performance might change depending on the absolute performance level. The API has to evolve to make new performance levels possible as the older hardware bottlenecks are eliminated, and to address situations like dispatching many smaller primitives.

edit: <this is part 2>

I’m not very impressed by NVIDIAs silence on this thread. This seems like a dirty secret and they’ve only reinforced that impression by offering no help here when they must have known what the issue was from post one. They’re just washing their hands and hoping that nobody will notice the huge performance deficit as they market the peak numbers.

To expect users to download a 3rd party tweak to make their card reach advertised performance or get lucky with an OEM is unreasonable. This situation stinks.

[This message has been edited by dorbie (edited 05-15-2002).]

Huh, dorbie, what are you talking about? I didn’t reply to this thread because I didn’t see anything specific worth replying to. (Now, there are some threads that I simply won’t reply to at all, except perhaps to correct a gross falsehood. Such threads typically relate to our competitors’ products, or our future unannounced products, or other such things. If you want to ask me a question or otherwise expect a reply from me, don’t post it in a thread like that.)

The matter was clarified in another thread. People can measure three things, all of which are very different from one another. One is vertices per second. One is triangles per second. And the last is indices per second.

The last is hard to measure because you will probably just hit the triangle limit. (But in theory a given chip can only process indices at some rate, even if they all get cache hits for post-T&L results.) So let’s ignore it.

The triangle rate is strictly a matter of triangle setup. Different chips have different setup rates. The original GF3, for example, would hit a setup limit at 40 Mtris/s. (You can deduce clocks/triangle from such a number; I leave this as an exercise for the reader.) Of course, it’s entirely conceivable that setup rates might depend on the attributes that need setup (since setup typically involves things like computing d/dx, d/dy for a triangle).

Another number is the vertex rate. This can depend heavily on the T&L modes in use, of course. It can also be limited by the size of the vertices, if those vertices are coming via AGP. Typically, peak vertex rates are measured in cases where AGP is not a bottleneck (in the limit, you may need to use shorts for vertices or use video memory) and the only computation needed is a transform from object space to window space.

The trick is that you often can’t measure vertex rates by using triangle strips, only triangle rates; because the triangle setup load is effectively tripled. So if your GF3 was running at 40 Mtris/s with long triangle strips, it would be incorrect to claim that its vertex rate was 40 Mverts/s. Instead, its vertex rate would have to be measured by drawing independent triangles, where triangle setup is not a bottleneck.

Once you keep all this in mind, the benchmark numbers all make sense. The short of it is that the GF4 Ti ends up with about twice the vertex rate (higher clock speed + architectural changes) and a moderately higher triangle rate (higher clock speed) than a GF3 Ti500. In many cases, vertex rates will be over twice as high.

As for fast writes and AGP sideband addressing, those are completely different issues. The decisions we have made w.r.t. these features reflect such things as our need to work around certain chipset bugs, for example. The same is true of falling back to AGP 2x on some platforms. However, if you’re using a GF4 Ti on the right motherboard, I believe you should end up with both FW and SBA on by default, as well as AGP 4x.

Which motherboard is that? To be honest, I have no idea. I have enough other things to keep track of… and I’m still stuck at AGP 2x myself with my BX system, so it doesn’t affect me.

Anyhow, please cut out the conspiracy theories. (If the conspiracy is no more than that marketing departments always choose the largest-looking number possible to promote a product, that’s hardly a conspiracy; if our marketing wasn’t doing that, now then I’d have a complaint.)

  • Matt

I think that there is something wrong with this BenMark5(it is Direct3D app, isn’t it) benchmark.
On my prehistoric Celeron 333, AGP 2X, with GeForce2MX i got 3.5 Mtris/sec on 640x480 display.
When I use that NVIDIA SphereMark(OpenGL) program I got 13 Mtris/sec no problem and when I use my own
little OpenGL fullscreen demo that renders 130 teapots with static VAR in video memory i got up to 20 Mtris/sec,
when I render 130 spheres I to got around 13 Mtris/sec, all this on same machine
(models are teapot and sphere4 from DirectX SDK).
I use fulscreen 640x480x32 with one directional light using GL_TRIANGLES with glDrawElements.
What’s the deal with this BenMark5?
How much tris you guys get with SphereMark?
Thanks, mproso.

Sigh, Matt it’s you who are confusing two or three issues.

The first part of my post relates to the API issue raised in the more recent discussion, and frankly I don’t really care about that. Nothing about what I wrote is incorrect and I don’t need to be told these are two separate issues when I addressed them entirely separately in my post.

The second part relates to the bull**** of having to download a tweaker to get the full performance with a GeForce card. For example a GeForce4 Ti 4600 delivering 17 million tris when it should be nearer fifty. It looks like very few if any have managed this without tweaks & hacks on a range of motherboards and drivers. God help joe public, who NVIDIA hopes just won’t notice the performance delta on his crippled hardware. In the mean time nobody from NVIDIA comments on this. This isn’t a conspiracy theory it’s a fair description of the situation.

Most of the performance deltas being discussed relate to the same software on the same hardware measuring the same thing, before & after “tweaks” have been applied. I don’t really care about someone complaining about the finer points of 38 million tris vs some even higher number with different data types, state or whatever, I used to deal with exactly the same customer issues at SGI and I can sympathize but that doesn’t excuse the other issue.

Maybe you should be more up front with people buying these cards. The fact that you are using AGP 2x is completely irrelevant, I don’t care if you have a Riva 128 on your desk. My Geforce4 languishing at 1/3 of its geometry performance without fast writes is the issue with a dearth of information on how to remedy that from the manufacturer.

[This message has been edited by dorbie (edited 05-15-2002).]

At least for OpenGL, fast writes are unlikely to have that sort of performance impact. I don’t know about D3D. But it could easily be something very simple: for example, we might disallow video memory vertex buffers in certain situations (which this app might be hitting) unless fast writes are available, and then the app might hit some AGP limit. I have no idea, to be honest.

But the simple fact is that we disable fast writes in some cases because we have no choice. There are a lot of chipsets out there that have bugs where data gets corrupted if you use fast writes. Say what you want about customer issues, but I think customers would rather not experience system hangs due to data corruption…

I’m not about to get in the business of telling you exactly which products are broken – but these data corruption bugs are real, and when we hit them, we have a choice between (A) disable fast writes on certain platforms or (B) don’t ship the product at all to retail because some user somewhere might plug it into the wrong motherboard. I think it’s obvious that we are going to choose (A).

  • Matt

Fast writes do appear to be the issue, at least that’s the main thing I explicitly enable using the tweaker, but there may be some other arcane thing going on under the covers.

Choice A is a reasonable one only if you inform customers. Allowing them to make an informed decision about their level of acceptable risk would be even better. You deliberately overrode my explicit bios settings, and I have to go and find a 3rd party utility (WCPUID3) just to discover that you did that.

It’s not as if this is a marginal issue, we’re talking about a 3X performance delta under some circumstances. Just look at the confusion at the start of this thread, most people contributing to this thread don’t get confused as easily as your average retail customer. It looks like nobody here has lucked out and got this to work spontaneously with their hardware. What do we have to do, buy an nforce mobo?

I’m skeptical about the criteria for tracking and testing this. There are lot’s of chipsets and even more mobos and still more BIOS patches, that (for example) patch AGP stability issues in AMD chipsets.