How to render 8 pixels per clock with NV3x?

I can only render up to 4 pixels per clock with NV3x cards (tested with Quadro FX 2000 and GeForce FX 5900 Ultra). I’m using my own little glut benchmark program, which just renders simple polygons, untextured or with a 2D texture.

On a ATI FireGL X1 card with 325 MHz, I expect up to 8 x 325 = 2600 MPixel/s and the benchmark shows 2462 MPixel/s. With the FX 5900 Ultra I get only less than 4 x 450 MHz = 1800 MPixels/s, exactly half of the expected 3600 MPixels/s.

I heard that the FX cards can only do 4 color and 4 Z operations in one pass, but 8 texture operations. I switched off the depth test and I don’t need the color of the fragment, I only need the result of the texture lookup (GL_REPLACE). Is it possible to render more than 4 pixels per clock with these restrictions?

Or is there ANY program that proves the existance of the “8 pixels per clock rendering engine” (from http://www.nvidia.com/docs/lo/2692/SUPP/NV_QuadroFX_0306.pdf)??) Any program that renders somehow more than 4 pixels per clock?

Johnny

Maybe they were saying that about the z filling pass when you do multi light perpixel lighting… I don’t remember exactly. Well commercial craps, as usual.

Did you disable filtering?

I believe you’re getting it backwards, i.e. it can do 8 color+Z (simultaneously) per clock, but only 4 textures.

– Tom

AFAIK you only get 8 pixels per clock on GeForceFx if you render Z/stencil only. This is a logical choice in a way since it helps multi sample performance as well. However, I don’t think you’ll get 8 pixels per clock with 8 textures used. You won’t get that on a 9700 either, since there’s just one TMU per pixel pipe. There are numerous old threads on Beyond3D that go over this in excruciating detail. Try searching for “NV30 pipeline” or something similar.

exactly. thats why most say its still not an 8pixel per clock card, only 4pixels…

thats btw what nvidia calls the ultrashadow thingy… that they do for non-color-writes twice the speed.

thats a Good Thing™ for doom3… a Bad Thing™ for shader intensive situations…

guess why futuremark was so bad, while doom3 is so good for nvidia…

The only way to get 8 ‘pixels’ per clock is
glColorMask(false,false,false,false);
This has nothing to do with the number of applied textures.

It’s 4x2 with bilinear texture samplers, just like Geforce 3/4Ti. Two samplers are combined for trilinear filtering (=> texturing speed is cut in half).

Ok, I see. So this 8 pixels per clock is indeed just commercial crap from nVidia. It is disappointing because ATI has 8 “real” pipelines, and I can’t have enough of them for volume rendering.

Thank you all for your help!

Dave, I think the ultrashadow thingy is actually marketeering for the ability to reduce stencil fill overhead through clipping I can’t find the link but I seem to recall some sort of z clipping extension being mentioned somewhere recently.

FWIW high speed z only is not so great for the first pass since any decent engine with good content will do ambient + emission when it writes the initial z value so it won’t be z only. If you’re still on fastpath for z+stencil then it’s REALLY significant since a heck of a lot of your fill performance comes from stencil shadow volume overdraw.

BTW, IMHO nvidia has obviously done some great work making stencil shadows robust & fast, they’re been really aggressive and innovative in introducing hardware capabilities & software research to make this stuff work well, and released code to get everyone up to speed on it, you’ve gotta at least give them that dave.

[This message has been edited by dorbie (edited 07-02-2003).]

BTW, IMHO nvidia has obviously done some great work making stencil shadows robust & fast, they’re been really aggressive and innovative in introducing hardware capabilities & software research to make this stuff work well, and released code to get everyone up to speed on it, you’ve gotta at least give them that dave.

Unless, of course, a preference for shadow maps, like me. Then, you see these optimizations as just a waste of nVidia’s resources. I’d rather have good render-to-texture performance at 8-pixels per clock with 32-bit floating-point luminance textures. nVidia’s hardware gives me none of those (poor ARB_render_texture performance, and only supports 4-channel floating point textures).

:-), I happen to think stencil and image based shadows can play well together. I like them both.

I found that link BTW:
http://www.nvidia.com/docs/lo/2968/SUPP/UltraShadow_050903_v3.pdf

Of course an extension like this could be prone to heinous abuse in benchmarks if one were dishonourable enough to handcode the clipping bounds along fixed paths where the competition did all the rendering requested of them by the benchmark/timedemo.

Well I’m partial to shadow maps myself, but nvidia has better support for them than most other vendors (provided you use ARB_shadow and integer depth textures). They have special “filter after r-coord compare” hw which means PCF shadow maps are available at the cost of a bilinear texture lookup. Not too shabby. ATI just does it in the fragment shade if I’m not mistaken (but then there fragment shader is very fast…). In the future you’ll want to do it in the shader probably to be able to vary the filter kernel size with occluder-receiver distance and other tricks, but for todays cards it seems so much cheaper to to fixed function filtering (and no, I haven’t benchmarked it). And manual bilinear filter in the fragment shader needs something like four texture lookups and a MAD, excluding texcoord calcs. Now, back to GL2 hacking

Here’s the extension:
http://www.nvidia.com/dev_content/nvopenglspecs/GL_EXT_depth_bounds_test.txt

It hasnt been put in the extension registry yet.

It basically gives you a screen aligned bounding box (when used with scissor rect), for stencil ops.

provided you use ARB_shadow and integer depth textures

And therein lies the great failing of that technique.

To me, the whole advantage of shadow maps vs stencil shadows is that your shader controls the result of the comparsion operation. Also, you only need N+1 passes, where N is the number of lights, as opposed to stencil shadows which require 2N+1 passes. By using the ARB_shadow extension, you’re forcing the compare to happen on only one texture per pass, thus forcing the 2N+1 pass requirement. By putting each compare in the fragment shader, I am limitted solely by the number of textures that can be bound at one time.

If I’m going to have 2N+1 passes anyway, I’d use stencil shadows and put up with the added fillrate hit. Stencil shadows lack most of the artifacts that plague shadow maps.

I don’t get it, why cant you just use depth texture lookups in the fragment program? You’ll get a nice smooth shadow value in the [0,1] range that you can use in your lighting equation any way you wish. How does this increase the number of required passes? Is there som sort of limitation on nvidia hw that says you only can bind one depth texture at a time or something? I looked over the ARB_shadow spec but it reads like you turn it on per texture unit.

EDIT: spelling…

[This message has been edited by harsman (edited 07-03-2003).]

as opposed to stencil shadows which require 2N+1 passes.

With the stencil-2-sides extension, this will cost N+1 passes as shadow maps.

Y.

Ysaneya: How do you suppose to render stenciled shadow volumes in only N+1 passes? The stencil buffer can hold shadow information for one light source only, so after rendering the shadow volumes for a single light, you have to add the intensity of that light separately. That gives 1 pass for ambient/emission and 2 passes for each light source.

flo

Shader control is too slow, you’d need multiple comparrisons and multiple maps. Ideally you want this in texture hardware pre-filter and this lets you post filter the result for fast soft shadows. The whole issue is resolution, and to a lesser extent z image space surface reconstruction with discontinuities.

I wouldn’t limit image based approaches to depth maps, it’s not even the most interesting currently.

With the stencil-2-sides extension, this will cost N+1 passes as shadow maps.

Oddly, no.

For stencil shadows, you first have an ambient pass. Then, you draw the shadow volumes for Light 1. Then you draw the geometry for Light 1. Next, you draw the shadow volumes for Light 2. Then, you draw the geometry for Light 2… Lastly, you draw the shadow volumes for Light N, followed by the geometry for Light N. Hence: 2N + 1.

For shadow maps, you first draw the map for Light 1. Then, you draw the map for Light 2… After that, you draw the map for Light N. Finally, you render the actual scene itself, using the N lightmaps as textures with texture coordinate generation to re-compute the vertex positions. Hence: N + 1.

Until you can have multiple stencil buffers, and do the stencil compare in fragment shaders, stencil shadows will always be 2N + 1.

I don’t get it, why cant you just use depth texture lookups in the fragment program? You’ll get a nice smooth shadow value in the [0,1] range that you can use in your lighting equation any way you wish. How does this increase the number of required passes?

Correct me if I’m wrong (it’s been a while since I’ve read ARB_shadow), but when you bind a texture that was created as a depth texture, then the depth texture is used to either replace or alter the per-fragment depth value. Since there is only one per-fragment depth value, I’m stuck with the 2N+1 approach of stencil shadows.

If you can just read from an ARB_shadow texture as though it were a regular texture, let me know. I might switch from 32-bit floating-point Luminance textures.

Shader control is too slow, you’d need multiple comparrisons and multiple maps. Ideally you want this in texture hardware pre-filter and this lets you post filter the result for fast soft shadows.

First, I’m not interested in soft shadows. Second, what is the “this” that you alledge that I want in the texture hardware pre-filter?

[This message has been edited by Korval (edited 07-03-2003).]

Correct me if I’m wrong (it’s been a while since I’ve read ARB_shadow), but when you bind a texture that was created as a depth texture, then the depth texture is used to either replace or alter the per-fragment depth value. Since there is only one per-fragment depth value, I’m stuck with the 2N+1 approach of stencil shadows.

Well, consider this a correction The ARB_depth_texture extension just defines the depth format for textures and what happens if you bind a depth texture when a tex unit is set up to expect regular RGBA texels (more or less at least). Fragment z-replacement is left to a separate extension. ARB_shadow just lets you set a compare function and then texture lookups on a unit with a depth texture bound are 0 if the comparison of depth texel vs. texcoord.r fails and 1 otherwise. If the texture is linearly filtered the result is proportional to the number of successful comparisons. No cubemaps but otherwise all is good. If you’re doing shadowmaps be sure to read the Perspective Shadow Maps paper if you haven’t already.