NV30 Extensions

Originally posted by Korval:
[b] Not true, especially with vertex shaders. Granted, an 8-bit per component position value doesn’t offer much precision, but it’s there, it works on older hardware, and it’s faster than rendering to a floating-point buffer, even on newer hardware.

Besides, I’ve never been particularly impressed with doing things like particle systems or other such things on the GPU. It’s a waste of resources, using something to perform a task that it is not optimized to do rather than performing the task on the CPU while rendering other stuff on the GPU. Rather than wasting precious GPU time on animating a mesh, I’d rather give that mesh more vertices/effects and do the animation concurrently on the CPU. The overall graphics quality of the rendering will be better, as will overall performance.[/b]

yeah, depending on the quality of the data you need you can use it even today…

its your problem if you’re not interested in getting first physics and such stuff away from the cpu as well (as it is faster on the gpu as well )…

at least its fun to have it… useful features we will find for sure…

Ugh, ARB_render_texture says that? Kinda lame.

You can certainly cause bad things to happen if you aren’t careful with ARB_render_texture. If you render into one of the levels that is being textured from, then you have created a nasty data hazard. The results will differ between various hardware.

But this is probably a case where the spec should have been more careful about leaving things defined whenever possible.

One thing I’m pretty sure is left undefined by ARB_render_texture is what happens if you render into one mipmap level while texturing from another. This can be useful to implement funky mipmap generation algorithms – texture from level n and render into n+1. After all, plain old averaging of everything is not always correct.

I would be 90% certain that what I proposed would work. If you can’t get it to work, let me know (via email).

On the topic of displacement mapping: the PDR/VAR algorithm I’m proposing is not without its flaws. I’m really not sold on it myself; I’m merely proposing it as something you might play with.

Hugues Hoppe has a Siggraph paper this year on geometry images. What I’m proposing is really just another geometry image algorithm.

  • Matt

its your problem if you’re not interested in getting first physics and such stuff away from the cpu as well (as it is faster on the gpu as well )…

It may take less real-time on the GPU, but, ultimately, the program-loop (especially for games) needs the results of physics. Unless you’re going to put your entire program loop up there too. In which case, you’re wasting your CPU. Remember, CPU’s are getting more powerful (and are certainly cheaper) than GPU’s. Having the CPU compute physics/animation/AI/etc while the GPU works on rendering will give you the same result in half the time. Not only that, you can send more stuff to the GPU, so that your graphics look much better.

Maybe for a demo, putting physics in the GPU is a good idea. But it has no real practical applications in the real world.

Originally posted by Korval:
But it has no real practical applications in the real world.

never say never…

i see quite a good use… i can simulate rain fully on the hw, all the rainparticles… so rain is now no problem, except it needs some fillrate… what do i gain from this? i can let it rain while my cpu can do heavy calculations for useful stuff… rain doesn’t hurt, except possibly the fillrate due the blending of the raindrops…

there are ways to use it, and there are ways to use it useful as well, i just stated the physics as it should show how much the features of this ext can give, first time you can actually process geometry on the gpu and store that information… and that is powerful…

Putting the physics of rain on the gpu seems a little out of place. That, or you have some very simplistic rain physics. Does your rain interact with the world? Does it splatter on rooftops? Can it form puddles and streams?

Originally posted by Korval:
Remember, CPU’s are getting more powerful (and are certainly cheaper) than GPU’s.

I could hardly disagree more with both halves of this sentence.

CPU’s are not getting more powerful than GPU’s. Quite the contrary, GPU’s are increasing their computational lead by doubling in speed at three times the rate of CPU’s. Currently, the best pentium can perform about, what, about 6 gigaflops? While the best graphics card that I can find data for (Geforce4) can perform about 120 GigaFLOPS.

CPU’s are also not getting cheaper than GPU’s. Let’s compare the latest:

Pentium 4, 2.8 GHZ - $537 (pricewatch)
Radeon 9700 - $399 (ebgames.com)

Not only is the CPU not cheaper in the absolute sense, but it also doesn’t come with 128 MB DDR RAM and has far fewer transistors on the chip itself. This makes the price disparity even greater, IMHO.

Given all this, I’d say that it’s probably wise to offload anything from the CPU that doesn’t require global interactions or complex condition testing.

– Zeno

[This message has been edited by Zeno (edited 09-01-2002).]

Originally posted by Zeno:
Currently, the best pentium can perform about, what, about 6 gigaflops? While the best graphics card that I can find data for (Geforce4) can perform about 120 GigaFLOPS.
Unfair comparison IMO. First of all it’s GOPS without an F in between
Most of it is blending operations (register combiners and stuff) in integer space. It’s not readily at your service.

There is certainly a lot of brunt in graphics chips, but it’s not as freely available as it is in CPUs.

That’s not saying that you shouldn’t use it when it makes sense. But the numbers can’t be compared. Benchmark it, then decide what’s better.

Originally posted by zeckensack:
Unfair comparison IMO. First of all it’s GOPS without an F in between
Most of it is blending operations (register combiners and stuff) in integer space. It’s not readily at your service.

First, what you are saying, Zeckensack, is in direct opposition to what nvidia says. Here is a snippet from their Geforce3 press release:

The GeForce3 is the world’s most advanced GPU with more than 57 million transistors and the ability to perform more than 800 billion operations per second and 76 billion floating point operations per second (FLOPS).

Now, I know I have seen the number 120 GFLOPS pertaining to Geforce4, I just can’t find that video right now on their web site. It’s not unreasonable that the Geforce4’s T&L is 1.6 times as fast as Geforce3.

Second, it’s not an unfair of a comparison in the context of this discussion, since what we’re arguing about here is whether the GPU faster at anything that it CAN do than the CPU. Moving particles of rain according to the vector equation like x’ = x + v*t is certainly something that the GPU would be good at.

– Zeno

Originally posted by Zeno:
Now, I know I have seen the number 120 GFLOPS pertaining to Geforce4, I just can’t find that video right now on their web site. It’s not unreasonable that the Geforce4’s T&L is 1.6 times as fast as Geforce3.
Then it must be my fault then. I was under the impression that this number was about ops, not flops. Looks like I was wrong. Point taken

Second, it’s not an unfair of a comparison in the context of this discussion, since what we’re arguing about here is whether the GPU faster at anything that it CAN do than the CPU. Moving particles of rain according to the vector equation like x’ = x + v*t is certainly something that the GPU would be good at.

– Zeno
I’ll try again.
I’d be perfectly comfortable comparing vertex shader ops with CPU ops. That’s an area where you can do tradeoffs. I’d also be comfortable with pixel shader ops, as long as they feed back into geometry. And that’s not an option on the current generation, which is where the numbers came from.

You can’t trade CPU ops for pixel shader ops because software pixel processing just isn’t an option.

It will be comparable on next gen, but that next will come with a new set of numbers.

And even then I’m quite sure that a big portion of these flops will be fixed function hardware like z iterators, 1/w iterators, triangle setup and clipping hardware which you cannot use for anything else.

Or to offer a different perspective, 136M verts/sec seems to be the transform maximum of the Geforce4Ti4600 ( proof ). That’s 3,8GFlops in my book*. Where’s the rest? It doesn’t get any faster when you start using more complex vertex operations. And there aren’t any other fp units under your control on a Gf4.
I’m not saying the remaining flops aren’t there, I’m just saying that they’re not at your disposal as they already have specific work to do.

assuming bare 3-float vertices and a single matrix mult per vertex (modelview/projection combined, no normal processing)
Vertex
Matrix mult is 28 flops, 16 mults and 12 adds

Good point…I can’t argue with your calculation. Most of the flops seem to be tied up in clipping/triangle setup areas which can’t be forced to do anything else.

Given this, I have to concede that the fastest CPU’s would be better at most geometry-level stuff than the fastest GPUs assuming that’s all they had to do. If not, benchmark, like you said

– Zeno

Given all this, I’d say that it’s probably wise to offload anything from the CPU that doesn’t require global interactions or complex condition testing.

Why? That just takes up valuable rendering time away from actual rendering. As long as my CPU time isn’t greater than the framerate I want to have, offloading CPU data to the GPU is rather useless.

Not only that, things that require global interactions are pretty much everything that isn’t rendering. Using a game as an example, the AI feeds into the animations, which feeds into physics, which feeds back into AI. Unless you can throw the entire thing up there, you won’t get any real substantitive benifit out of it. And most of the really CPU heavy stuff (outside of fustrum culling algorithms like BSP’s and so forth) are in AI, animation, or physics. And, remember, these cards still live across the PCI bus. Transfering data back from them for use by the CPU is going to be a very slow proposition.

Now, if it were possible, one thing that would be good to do would be to somehow put visibility culling onto the GPU (a significant source of performance degrading/CPU bottlenecking on many games). The problem with that is that vertex programs are running per-vertex, not per-object. This has the inherent problem of causing vertex programs to run much slower than necessary.

The thing about offloading stuff to the GPU is that the GPU is the only thing that can render. By offloading this processing there, you are guarenteed to lose rendering time.

As for the comparitive expence of CPU’s vs. GPU’s, I was only considering Athlons. P4’s are priced excessively high, even taking into account the performance gains over Athlons.

As for the power argument, FLOPS, SHMOPS . You still have to implement it on hardware that was, fundamentally, not designed to handle this kind of processing. You don’t have a lot of data to work with, and the read-back buffer stradgy requires waiting until the read-back is finished (you can do other things asynchronously, but the read itself will still take time). It may be able to perform a great deal of operations per second, but understand that many of these ops (like scan conversion and texture-coordinate interpolation) probably aren’t going to be of great use to a physics system.

Originally posted by mcraighead:
Ugh, ARB_render_texture says that? Kinda lame.

Yeah, I thought ARB_render_texture was the answer to my prayers, until I read that.

Lifting this restriction would be really, really, really great!

The other thing I don’t like about this extension is the WGL_ bit. How might one go about getting such functionality running under Linux?

i dislike the whole rendertexture stuff… i dislike in fact every part of that wgl-stuff… its not portable. i want a simple rendertotexture wich is portable… is that so difficult?

The answer to that is GLX_ARB_render_texture.

  • Matt

Originally posted by mcraighead:
[b]The answer to that is GLX_ARB_render_texture.

  • Matt[/b]

Has that been ratified by the ARB yet?

The only references Google can find to it are some ARB minutes from June 2001 saying basically that the spec. hadn’t been finished, for no apparent reason.

I really don’t know.

  • Matt

Originally posted by nutball:
[b] Has that been ratified by the ARB yet?

The only references Google can find to it are some ARB minutes from June 2001 saying basically that the spec. hadn’t been finished, for no apparent reason.[/b]

There is no GLX render texture spec approved by the ARB as far as I know.

Originally posted by davepermen:
[b]finally we will be able to “render into vertexbuffers”… its actually the most awesome feature of the nv30 imho…

btw, i hope this gets possible in dx as well, as i can’t use opengl everywhere…

anyways, this will rock… its the most advanced step forward imho somehow, it will give you the power to do extremely complex calculations fully on the gpu… hehe, can’t wait for it… hehe [/b]

Out of interest, that is the same method P10 uses to do displacement mapping:

The displacement lookup (and optionally the tessellation) is done by the texture subsystem and the results left in memory where they can be read just like a regular vertex buffer. On the second pass the vertex shader will pick up the displaced vertices, light them and then they get processed as normal. This is a good example of using the flexibility of the SIMD arrays for not just their default purpose.

http://www.beyond3d.com/articles/p10tech/index.php?page=page6.inc

[This message has been edited by evanGLizr (edited 09-05-2002).]