NV30 Extensions

Adrian,
No, but you’ll have to use render to texture with two textures (one used as a texture with previous results and 2nd as a destination, swapped every pass) and do your blending in the fragment program.

[This message has been edited by coop (edited 08-30-2002).]

coop, I’m with you. I want transistors to be dedicated to more and faster pixel and vertex pipelines, not fixed function operations.

If filtering or blending requires a whole extra set of floating point units, then I would rather have those unit’s transisters go towards building more general floating point units for the pixel pipelines.

In a high level language, the typical filtering techniques will become standard library functions anyway, so why do we care
how they are implemented?

Why would you need blending when you can render to texture? That is a much more general multipass solution.

Originally posted by Nakoruru:
Why would you need blending when you can render to texture? That is a much more general multipass solution.

I agree it’s more general than standard blending but we often don’t need any sophisticated blending function. I think the most commonly used is just an addition(src + dest) p.e. accumulating lights in per pixel lighting. Using render to texture is a little too complicated for me in this case and it doubles the memory required (unless we can render to texture that we are texturing from). That’s why I think at least simplified blending should be supported (addition, maybe multiplication).

Coop

We do have plans to make NV_vertex_program2 work well with ARB_vertex_program. However, we also want it to be backwards compatible with NV_vertex_program.

In all honesty, there isn’t a huge difference between NV_vertex_program and ARB_vertex_program. You can even use the ARB APIs to load your NV program – just call ProgramStringARB rather than LoadProgramNV.

Any approach would have required a new NV extension, because NV_vertex_program2 has a large number of new instructions and features that are not present in ARB_vertex_program.

I don’t know if it’s already in the spec, but what I believe you can do is write an ARB program and put in a special “OPTION NV_vertex_program2;” statement (I don’t know if that’s the precise syntax, but it’s the right idea). This will let you write VP2 programs using more-ARB_v_p-like syntax.

I suspect this is already in both the spec and emulation driver, but I’m not 100% sure.

I think it’s pretty clear from our inclusion of this option that our intent is to make all this stuff work quite smoothly with the ARB framework. I’m just not 100% clear on what the status of everything is – I haven’t been working on this myself.

  • Matt

By the way, we even made some last-minute changes to NV_fragment_program to make it work better with the ARB_v_p framework – for example, we added numbered local parameters in addition to named parameters. We’ve put quite a bit of effort into making sure that all this stuff works together nicely.

  • Matt

By the way, I’m not saying that there will never be anything that lets you do anything along the lines of blending with float buffers. There are a few different approaches, each with advantages and disadvantages.

It’s just that designing hardware is all about tradeoffs.

Let me put it another way. What was the last time anyone ever told us that they do NOT want certain features in their graphics card? If you take all the features that everyone wants, and put them all in, you will find yourself up at that 10 billion transistor count.

  • Matt

Good point Matt, but I have started to get into the habit of saying ‘please do not give me any new fixed-functionality and throw away all the legacy stuff you can’

On another point, having only 8 texture coordinates with 16 possible textures really makes sense if you think of it as 8 ‘input’ textures and 8 ‘output’ textures (although, of course, they are more general than that).

I really do not think the added complexity to do what would be a simple glBlendFunc in old opengl is a big deal. It seems like lazy programmers to me. glBlendFunc was invented because thats all that hardware could do, now that you can mix colors from anywhere in anyway you want why would you want an old crutch like glBlendFunc? Use the tools you are given to implement your own purpose built blendfuncs.

I would rather not have it hardwired, because it obscures what your shader program does because its an external piece of state. Your shader program could not stand alone. To be understood, you would have to say ‘oh, by the way, the result is blended with the frame buffer after this’, why not just do something equivalent in the shader program?

I would like the shader to be the final word on where colors come from and where they go and the sooner that stencils and alpha tests, depth tests, register combiners, and blending go away the better.

I know that this is an extreme position, but I think its where we are going. However, I am a practical person so I understand why things are the way they are now, and that some things will always be more efficient if they are hardwired.

Things like the huge optimizations that ATI and nVidia has brought to things like the depth-test probably mean that they are here to stay.

I agree. I think the ability for us to program the blending functions ourself in a shader program is The Right Thing. I mean, the blending possibilities would be endless. The way they are now are of course fixed and sometimes we have to juggle things around a bit to use the fixed function blending modes in the way that we want. With the blending being programmable, we could use some funky math formula to do a blending effect not possible otherwise. Being able to do this may be a little ways off but i’m quite convinced this is what will need to happen eventually. In the mean time, I cant wait for my CineFX NV30 card. I’m so glad I can now emulate it, now I can start on some of the things I want to do that requires this card (the ati 9700 will work just as well though). That’s if I can figure out how to do the displacement mapping stuff. I hope that feature can be emulated in the drivers right now anyway.

-SirKnight

Okay, so I have to pay two screen-size frame buffers to do blending. But then I get as much blending as I want, in floating point. Fill rate is no concern, right? :slight_smile:

As far as per-vertex attributes go, and passing them to the vertex program, you can store all the data you want in a look-up table, previously known as “texture”. Then you can set up the interpolators to spit out weight values for each of the three verts, rather than some post-interpolated value. I believe with this set-up, you’re theoretically limited more on the number of fragment instructions and addressable textures, than the amount of data that you can pass from the vertex shader to the fragment shader.

It probably puts a fair bit of load on that fragment processor, though. I’d better go back and start working on reducing overdraw :slight_smile: (although spiffy-Z ought to save my bacon a little bit already)

Originally posted by mcraighead:
[b]I don’t know if it’s already in the spec, but what I believe you can do is write an ARB program and put in a special “OPTION NV_vertex_program2;” statement (I don’t know if that’s the precise syntax, but it’s the right idea). This will let you write VP2 programs using more-ARB_v_p-like syntax.

I suspect this is already in both the spec and emulation driver, but I’m not 100% sure.

I think it’s pretty clear from our inclusion of this option that our intent is to make all this stuff work quite smoothly with the ARB framework. I’m just not 100% clear on what the status of everything is – I haven’t been working on this myself.

  • Matt[/b]

I have been working on this myself. :slight_smile:

If you stick an “OPTION NV_vertex_program2;” at the beginning of your ARB vertex program, the compiler should automatically accept any “!!VP2.0” constructs (condition codes, branch labels and instructions, jump tables, new instructions, and so on).

The only thing from NV_vertex_program and NV_vertex_program2 that you won’t get automatically are the pre-defined register names (R0-R15, c, v, o). For those who care, I wonder if something like “OPTION NV_program_registers” might give you something like that? :slight_smile:

This path needs to be tested more rigorously, so is not yet documented in the NV_vertex_program2 or ARB_vertex_program specs. If anyone tries these options and gets something funky, shoot me an email.

I do believe that some sort of “blending-like thing” will show up in the future for float buffers. I just don’t think that it will necessarily work the same way as glBlendFunc.

At the bare minimum, I’d predict that you would not get the ONE_MINUS modifiers the way you do so cheaply with glBlendFunc.

Let me make a suggestion. If what your app wants to do (which is a fairly common sort of thing) is to composite N light sources on top of one another with high dynamic range, then there’s a good way to do this. (The obvious case where you need this is shadow volumes, where you only really get to do one light source at a time.)

Create a double-buffered float (probably 64-bit, since you probably don’t need full IEEE for lighting computations) pbuffer, with a depth buffer. First, render your whole scene into depth, with color disabled. Then, do all subsequent passes with depth writes off.

On the first pass, render into your “front buffer” (scare quotes to indicate that it’s not visible, because it’s a pbuffer) for the first light. Use your normal shader.

On the second pass, bind the front buffer as a texture using RTT, and use WPOS or the like as your texture coordinate, and render into the back buffer. Use a slightly modified shader; you will texture out of the front buffer and add your lighting computation into that result.

From there on, just alternate between the two buffers. When you’re done, do some sort of fancy HDR processing into your real window.

This is almost as good as real additive blending. It costs some extra memory, but it avoids some of the ugly synchronization problems that could show up if you were to texture out the same surface you were rendering to, i.e., effectively “blend”. (Hint: there’s a data hazard when you have a deep graphics pipeline.) Although, in this case, I think you might actually get lucky because you did the Z pass first, and so you would only hit each pixel once.

Note that the cleverness here has to do with the fact that front and back of a given pbuffer share the same depth buffer.

  • Matt

Originally posted by Nakoruru:
Good point Matt, but I have started to get into the habit of saying ‘please do not give me any new fixed-functionality and throw away all the legacy stuff you can’

Nice post Nakoruru. I’ve been thinking this way myself, starting with the ideas they put forth in the Geforce3. I would rather they got rid of the fixed-function pipeline altogether and have the drivers build custom VP’s on the fly to emulate it, if it would free up transistors that could be used for more programmability.

I’d trade some speed for the additional flexibility, but I can also understand that end users usually only see the speed side of the equation.

– Zeno

Oh yeah. On the topic of displacement mapping, there is at least one way to accomplish this. Probably more ways exist that I haven’t thought of.

It will sound slow at first; bear with me.

Render into a float buffer surface, using whatever sort of cool fragment program computation you want to displace your vertices. Your “color” output in RGB is just your vertices’ XYZ position.

Use ReadPixels. Then, point your vertex array pointers at your “pixel” data you just read back, and blast it back into the HW as vertices.

Slow because of ReadPixels? Not really, at least if you use the (new) NV_pixel_data_range extension. Use wglAllocateMemoryNV to get some video memory. ReadPixels into that video memory using what is known as a “read PDR”, and then use VAR to draw the vertices. No bus traffic required.

Your indices can just be constants that represent your surface topology.

  • Matt

Originally posted by mcraighead:
[b]Create a double-buffered float (probably 64-bit, since you probably don’t need full IEEE for lighting computations) pbuffer, with a depth buffer. First, render your whole scene into depth, with color disabled. Then, do all subsequent passes with depth writes off.

On the first pass, render into your “front buffer” (scare quotes to indicate that it’s not visible, because it’s a pbuffer) for the first light. Use your normal shader.

On the second pass, bind the front buffer as a texture using RTT, and use WPOS or the like as your texture coordinate, and render into the back buffer. Use a slightly modified shader; you will texture out of the front buffer and add your lighting computation into that result.

From there on, just alternate between the two buffers. When you’re done, do some sort of fancy HDR processing into your real window.[/b]

Matt, I’m glad you touched this topic. The scenario you described is not allowed (or, at least, not assured) by WGL_ARB_render_texture specification. There is unfortunate limitation:

(Issues section)
14. What happens when the application binds one color buffer of a pbuffer
to a texture and then tries to render to another color buffer of the
pbuffer?

If any of the pbuffer's color buffers are bound to a texture, then
rendering results are undefined for all color buffers of the pbuffer.

I must say I can’t imagine any technical reason that would justify this. And I wouldn’t be surprised if it just worked as “expected” in existing drivers, despite violation of specs. Well, the “undefined” result could happen to be exactly the same as the “expected” one, right?

I need to have several screen-size color buffers for multipass purposes. I investigated 3 options:

  1. Create n pbuffers, each with own color + depth data, and use WGL_ARB_render_texture.
    This option obviously sucks, because it would ruin all early Z culling benefits.

  2. Use single standard color + depth frame buffer, and do CopyTexImage a lot.
    This is what I currently do.

  3. Create 1 pbuffer with multiple color buffers within (FRONT/BACK/LEFT/RIGHT/AUXi…) + one shared depth buffer, and use WGL_ARB_render_texture.
    This would be ideal, but then I learned about mentioned limitation. I didn’t perform any tests, because anyway I didn’t want my app to rely on undefined results. So, by now I’m staying with option 2.

My questions:
a) Are there chances for this limitation to be removed from specs? Or patched with something like WGL_ARB_render_texture2 ?
b) Is it well supported (read: accelerated) to create many color buffers within single pbuffer (like FRONT/BACK/LEFT/RIGHT/AUXi…) ?

Originally posted by mcraighead:
[b]Slow because of ReadPixels? Not really, at least if you use the (new) NV_pixel_data_range extension. Use wglAllocateMemoryNV to get some video memory. ReadPixels into that video memory using what is known as a “read PDR”, and then use VAR to draw the vertices. No bus traffic required.

  • Matt[/b]

Does this mean VAR will allow us to allocate more than one array of memory (one for AGP mem. and one for Video mem., for example)? And will it be possible switch between those arrays quickly?

Thanks.

matt, you made it! i could KISS you if you where a) female, hehe, and b) somewhere here around…

finally we will be able to “render into vertexbuffers”… its actually the most awesome feature of the nv30 imho…

btw, i hope this gets possible in dx as well, as i can’t use opengl everywhere…

anyways, this will rock… its the most advanced step forward imho somehow, it will give you the power to do extremely complex calculations fully on the gpu… hehe, can’t wait for it… hehe

I would like to discuss another side of the fragment program – performance. With the register combiner model, it was (and is) straightforward to predict performance implications, since it is a fixed operator pipeline with programmable routing model. If you know how many stages exist in hardware, you know where you stand (although you can get into the re-iteration and it slows into just 1 ‘pass through’ per 2 clocks). With the texture shader, the situation was pretty much the same. Again, you have to keep in mind that there are only 2 TMUs (not 4), but it is again a pipeline model.
Now with the fragment program, how can I predict performance? For example, how many TMUs are there? I guess not 16… So is each program instruction taking up one clock? What about stalls? What happens if I do 3 conssecutive TEX (or TXD or TXP) instructions and there are only 2 TMUs ? Nobody is even saying how many TMUs exist inside there… What about data hazards? One instruction using as input the output of the previous instruction? Should I worry about instruction ordering to avoid data stalls? (like you do when programming assembler for CPU). I’m affraid the answer will be “use Cg, we in the backend know better than you how things really work deep down, and will optimize for it”. But does Intel keep CPU details secret and tell people to just use C/C++/VB ?

its the most advanced step forward imho somehow

Well, it isn’t that advanced (except for the floating-point part). After all, it’s not NV30-specific. I’ll bet even a GeForce1 implements NV_PDR.

its quite advanced in the features it can provide… you can render into geometry (only useful if you can render floats anyways)… the posibilities are awesome, from autoupdated and on gpu animated meshes, particles running fully on gpu (even interacting with geometry sort of, hehe), to actual quite helpful tools to implemenent raytracing, it provides tons of new possiblities…
updating the whole water animation on hw, let the water surface move… etc…

only useful if you can render floats anyways

Not true, especially with vertex shaders. Granted, an 8-bit per component position value doesn’t offer much precision, but it’s there, it works on older hardware, and it’s faster than rendering to a floating-point buffer, even on newer hardware.

Besides, I’ve never been particularly impressed with doing things like particle systems or other such things on the GPU. It’s a waste of resources, using something to perform a task that it is not optimized to do rather than performing the task on the CPU while rendering other stuff on the GPU. Rather than wasting precious GPU time on animating a mesh, I’d rather give that mesh more vertices/effects and do the animation concurrently on the CPU. The overall graphics quality of the rendering will be better, as will overall performance.