NV30 Extensions

nVidia has just released their Detonator 40 drivers with emulation for NV30 along with OpenGL extension specifications.
http://developer.nvidia.com/view.asp?IO=nv30_emulation

After a cursory look at their slides, I’d have to say that they’re doing some interesting per-fragment things. However, one thing caught my eye: 16 texture maps, but only 8 texture coordinates? What kind of crap is that? Sure, if you really need more than 8 texture lookups, you’re probably going to be running some dependent accesses anyway, but why not err on the side of caution? The 9700 allows for 16 independent texture coordinates.

Somehow, I still prefer the 9700 better, even though the NV30 allows for new features. They both have full 32-bit float registers, but the 9700 doesn’t have a 16-bit option. This is a benifit, as it keeps the 9700’s language cleaner. The derivitive instructions sound like a good idea, but I’d prefer to do that on the CPU (as a pre-processing step, of course), especially considering that the height map may not be at the full resolution that it was originally. Not only that, those derivitives are probably a bit expensive, as they require 2 texture fetches.

Wow, the fragment program extension looks too good to be true…

Being able to output 8 texture coordinates from a vertex program seems like plenty to me.

Why dis the derivative functions? You cannot do per pixel derivative calculations on the cpu (quickly). At least, not for procedural textures. The derivatives are for any arbitrary expression you calculate per pixel. I think you misunderstand how it works if you think you can emulate that.

Originally posted by Korval:
[b]After a cursory look at their slides, I’d have to say that they’re doing some interesting per-fragment things. However, one thing caught my eye: 16 texture maps, but only 8 texture coordinates? What kind of crap is that? Sure, if you really need more than 8 texture lookups, you’re probably going to be running some dependent accesses anyway, but why not err on the side of caution? The 9700 allows for 16 independent texture coordinates.

Somehow, I still prefer the 9700 better, even though the NV30 allows for new features. They both have full 32-bit float registers, but the 9700 doesn’t have a 16-bit option. This is a benifit, as it keeps the 9700’s language cleaner. The derivitive instructions sound like a good idea, but I’d prefer to do that on the CPU (as a pre-processing step, of course), especially considering that the height map may not be at the full resolution that it was originally. Not only that, those derivitives are probably a bit expensive, as they require 2 texture fetches.[/b]

Actually, it’s the same on the 9700.

Ahhhh

I was just happy that there is an arb_vertex_program that all would implement, and now they did their nv_vertex_program2.
But beside that the new features look damn good.
One thing all the time it was mentioned the NV30 can do >65000 instructions in VPs, and everybody thought cool more VP code then c++ code.
But now it shows that there is an instruction limit of 256 instructions, and only with loops and branches it is possible to achieve the large number of instructions.
Also it was nowhere mentioned how long an instruction takes or how long it takes to do 65000 instructions.
Would be interesting.

Lars

I think the floating point implementation is kinda half-hearted. No filtering, only texture_rectangle (no 1d,2d,3d,cubemap), no mipmapping, no blending, not supported for texenvs.
Rendering to a high precision cubemap is certainly one of the features I want most in the new generation hardware, so this is kind of a disappointment :stuck_out_tongue:
Anyone knows if any of these restrictions holds for the R9700 too?

Humus. Here I completely agree with you. I really wanted floating point cube maps too. It really feels like a first generation effort. I guess we can’t expect floating point to be feature completely on the first go (well, we CAN expect it, and most of us were, but that doesn’t mean we will get it ^_^).

Well, on my old geforce2MX there are couple of new extensions:
GL_ARB_vertex_program,
GL_ARB_window_pos,
GL_NV_point_sprite,
GL_NV_pixel_data_range.
What does this last extension do?
It is not in the specs.

I guess us linux guys are at the mercy of the driver writers.

Also it was nowhere mentioned how long an instruction takes or how long it takes to do 65000 instructions

NVIDIA has always stated that each vp instruction takes exactly one clock cycle. If the GPU were 400 Mhz, then 65000 instructions would take 0.0001625 seconds per vertex. This translates to a little over 6000 triangles per second if you use strips and are not limited by anything else.

Edit:

I just realized that my calculation assumes only one VP unit and that vertices cannot be streamed through.

– Zeno

[This message has been edited by Zeno (edited 08-29-2002).]

You cannot do per pixel derivative calculations on the cpu (quickly). At least, not for procedural textures.

If, by “procedural texture”, you mean that it is something generated on the CPU and uploaded to the card each frame, I don’t see the point of not doing the derivative on the CPU.

If, by “procedural texture”, you mean a set of fragment program state that generates a color based on a “texture coordinate input” and various parameters, then yes, it has some use. I still submit that this instruction can’t be particularly fast, considering that it is using information in fragments that may not be currently rasterizing yet.

Actually, it’s the same on the 9700.

What is the same on the 9700?

NV_pixel_data_range is somewhat analogous to NV_vertex_array_range. It lets you do faster streaming of textures and asynchronous ReadPixels.

  • Matt

DDX/DDY are essential to implementing “analytic antialiasing” of shaders. This is a standard idiom in the Renderman world. (The derivative instructions aren’t really that expensive.)

Some of the standard OpenGL pipeline features make very little sense to implement for floating point. Blending is a key example. All the blending operations are predicated on “1-x” being really cheap to implement – it’s just an XOR for fixed-point. But in floating-point, 1-x requires essentially a full FP math unit, with a variable-width shifter and all! It’s unreasonable to expect that some of these pipeline stages will ever be implemented in their classic OpenGL form for floating-point framebuffers.

The same can be said for float texture filtering. It’s unclear that it will ever be a high-performance feature. Too many adds and multiplies, even for the simplest filter modes.

Remember that “8 texture coordinates” really means 8 texture coordinate outputs from the vertex program, i.e., 8 general-purpose interpolants. You have 16 vertex attributes going in, which is more than enough for almost anything. And you can always compute texture coordinates analytically in any way you want inside your fragment program. The key word is “interpolants” here. You get 2 color interpolants (fixed-point [0,1] range), and you get 8 generic (“texture coordinate”) interpolants.

Also, it’s not true that each instruction in a vertex program takes exactly one clock cycle. This is a rough estimate for the NV20 architecture, but you can do better or worse, depending on your exact program.

  • Matt

Ok, but support for 1D/2D/3D/cube texturing with simplest mipmapping (just GL_NEAREST_MIPMAP_NEAREST) wouldn’t hurt.
All necessery math is already there, for standard fixed-point texels. Computation of memory location of fetched texel is always the same for any texel format. I don’t understand this limitation.

Correct me if I’m wrong, it seems fragment shader is powerful enough to emulate any texture dimensionality (3D, 4D, 5D, … ?) (maybe even mipmapping?) at the cost of extra instructions, and tricky packing texels into texture_rectangle. So cube depth map is not so hopeless task?

Originally posted by mcraighead:
[b]Some of the standard OpenGL pipeline features make very little sense to implement for floating point. Blending is a key example. All the blending operations are predicated on “1-x” being really cheap to implement – it’s just an XOR for fixed-point. But in floating-point, 1-x requires essentially a full FP math unit, with a variable-width shifter and all! It’s unreasonable to expect that some of these pipeline stages will ever be implemented in their classic OpenGL form for floating-point framebuffers.

so you’re just lazzy… there is no point in treating floatingpoint buffers as something special, and it will by no means stay this way. at least supporting a fixedpoint blendingfactor (it has to go from 0 to 1 and not more anyways for getting the standart working for floats) would not be that hard… but okay, its work, and you have enough problems to get that out till christmas. good luck btw…

The same can be said for float texture filtering. It’s unclear that it will ever be a high-performance feature. Too many adds and multiplies, even for the simplest filter modes.

there, for sure, its just in the range of 0 to 1. it is an essencial feature, and the result will be all will just do it by hand. it doesn’t mather that much if its slower, but it should be supported because it makes live easier. it will for sure be one of the first functions i have to code manually, just because its not standart. at least bilinear would not have been that terrible…

Remember that “8 texture coordinates” really means 8 texture coordinate outputs from the vertex program, i.e., 8 general-purpose interpolants. You have 16 vertex attributes going in, which is more than enough for almost anything. And you can always compute texture coordinates analytically in any way you want inside your fragment program. The key word is “interpolants” here. You get 2 color interpolants (fixed-point [0,1] range), and you get 8 generic (“texture coordinate”) interpolants.

and you get only 4 textures if you use the standart pipeline. (just to note that, thats no good or bad… ). btw, what i really thought is funny, is, that, in the end, you get the standard register combiners. when i read that, i rofl… not that its bad, its just ridiculous somehow (and i thougth we finally get rid of that… )

Also, it’s not true that each instruction in a vertex program takes exactly one clock cycle. This is a rough estimate for the NV20 architecture, but you can do better or worse, depending on your exact program.

how can we do bether or worse by using the instructions? but that they are not at the same speed is quite logical somehow…

btw, you still don’t provide percomponent negotiation, do you? this would be essencially useful for quaternion multiplications…

oh, and is executing an instruction with one of those (LE.x) attachments slower than without?

Originally posted by Nakoruru:
Humus. Here I completely agree with you. I really wanted floating point cube maps too. It really feels like a first generation effort. I guess we can’t expect floating point to be feature completely on the first go (well, we CAN expect it, and most of us were, but that doesn’t mean we will get it ^_^).

i remember the time of gf1 and gf2… i expected full perpixellighting (as the marketing suggested this)…
again, we CAN expect it, but we won’t get it…

i think this is really stupid… i wanted to base my code fully on floatingpoint. setting up a 128bit floatingpointbuffer, using all floatingpoint textures, and all… i don’t care if its at 1/4 the speed than 32bit versions… it will be much faster than my gf2mx anyways. and the image quality is awesome (doing a lot of softwarerendering i know how real floatingpoint math does look like… wow…)

but no, once again a generation of gpu’s filled with hacks… i want my gl2…

Korval,

You still cannot calculate the derivative on the CPU because the fragment program derivative is relative to window space x and y. If you calculate a derivative for a texture it will be relative to texture space s and t.

Hmmm, emulating floating point cube maps with a texture rectangle… That fits right into my ‘no more new fixed functionality’ philosophy.

From now on I really would prefer new program instructions over new fixed functionality, as long as it makes sense to do so. However, a few standard filters (bilinear!) implemented for floating point would not be too bad. At least in a high level shading language it would be implemented as a function call, and it doesn’t really matter if the hardware truly supports it.

Is it just me, or does it seem like with all these extensions I should call this nVidiaGL because it seems that I could write a program that uses almost no standard OpenGL by using nVidia’s extensions. The only standard thing left seems to be texture objects!

it doesn’t mather that much if its slower, but it should be supported because it makes live easier.

If people were willing to buy a $1000 GPU, putting enough FP MACs and dividers on the chip to perform floating point texture filtering, blending, and mipmapping wouldn’t be an issue.

In the mean time, here’s a Cg routine that will bilinear filter a float texture for you (I haven’t tested this, but it should work):

float4 FilterFloatBuffer(samplerRECT tex, float2 uv) {
float4 deltabase;
deltabase.xy = frac(uv);
deltabase.zw = floor(uv);
float4 smp = f4texRECT(tex, detabase.zw);
float4 accum = (1.0.xxxx-deltabase.xxxx)smp;
smp = f4texRECT(tex, deltabase.zw+float2(1,0));
accum = accum + smp
deltabase.xxxx;
accum = accum * (1.0.xxxx-deltabase.yyyy);
smp = f4texRECT(tex, deltabase.zw+float2(0, 1));
float4 tmp = smp*(1.0.xxxx-deltabase.xxxx);
smp = f4texRECT(tex, deltabase.zw+float2(1,1));
tmp = tmp+smpdeltabase.xxxx;
accum = accum + tmp
deltabase.yyyy;
return accum;
}

i want my gl2

The first gl2 part doesn’t support any floating point arithmetic at all. Floating point support in GPUs is definitely a case where Moore’s law is a limitation.

[This message has been edited by gking (edited 08-30-2002).]