Carmack .plan: OpenGL2.0

Julien_Cayzac · July 1, 2002, 12:57pm

Originally posted by barthold:

OpenGL2 is the direction of the ARB (see the last ARB meeting minutes).

When will those damn minutes be online ???

Julien
human monitor of the ARB page

imported_John_Pollar · July 1, 2002, 1:01pm

blending can NEVER do dotproducts (except with 12 or 24 passes, don’t remember… and with clampingerrors…), so you have to split the dots between the passes and hope they don’t need them directly and all that…

You are thinking about details too much. Think of it like a CPU. You can do whatever you want. You can pretty much dream up the math, and it works. Your only enemy is serialization.

This is where the (future) gfx HW comes in. You write the code, and the HW figures the rest out. You don’t even need to know how many TMU’s there are. There could only be 4 TMU’s, but there might be 16 parallel math processors. Or there may be 32 TMU’s. Those are details, that we don’t need to know about. We just write code, assuming it will be serialized, and parallelism is taken care of by the HW. That’s not to say though, that you can’t write the code to be friendly with the HW, to get maximum parallelism.

In the future, I don’t think we will even be thinking of it as “passes”. There will be no more passes. We will think of it in terms of how much parallelism a piece of code will get on a particular piece of HW.

I know this sounds crazy, but I’m pretty sure this is the future. It might be 3-4 years before it takes effect though. Maybe I’m just crazy though

In the mean time, I’m just dreaming up ways the current drivers could possibly implement this functionality on existing HW. Kind of ease the transition.

knackered · July 1, 2002, 1:33pm

davepermen - you should lay off the caffeine, you’re sounding a little manic…and you know about as much about the internal details of a 3d card as I do (ie. about 10%).

It’s refreshing to hear this kind of vision from 3dlabs. I agree entirely. The whole point of opengl1.0 was its transparency - the programmer didn’t need to know how the triangle was textured, it just was textured…just as I don’t need to know how many colours a context is rendering in, I just give values from 0 to 1. The hints mechanism was designed to give hints, not explicity tell the driver what to do.

Now, tell me why there isn’t a dot product framebuffer blending mode. It would solve some problems in the short term, so give us it!

imported_jwatte · July 1, 2002, 1:53pm

> I expect that – in practice – artists
> will be writing different shaders for
> different generations of hardware.

I expect that – in practice – artists won’t be writing shaders at all. Artists will be configuring 3dsmax or Maya, or whatever, using the knobs that these tools provide. Then it’s up to programmers to turn those knob settings into shaders.

Korval · July 1, 2002, 8:12pm

You are thinking about details too much.

Why is that so terrible? Because it tears your vision of the future apart? Well, somebody’s got to look at the details. If it’s not us, then it’s somebody else.

Think of it like a CPU. You can do whatever you want. You can pretty much dream up the math, and it works.

A graphics card is not a CPU, nor should it ever become one. At best, it will become 2 CPUs: a per-vertex processor and a per-fragment processor. Precisely what they can do will be limitted for performance reasons.

You see, the closer graphics chips get to full-fledged CPU’s, the faster they lose their one advantage over CPU’s: performance. The only reason we aren’t all writing software rasterisers is because graphics chips do them faster. Adding all this “programmability” will simply slow them down (or drive the prices up).

There are reasons why programming the texture unit’s filtering is not something that hardware developers are even considering. There are reasons why every “displacement mapping” technique being proposed is done with vertices rather than fragments. Those reasons are performance.

If you look at the Stanford shader, which is the closest thing currently avaliable to what you’re asking for, even it has limitations. It will reject some shaders on some hardware as being too complex. You’ll also find that its performance leaves much to be desired, both on the compiling end and on the running end.

The only way to make something like this even remotely fesible (for shaders that will actually compile on the hardware rather than simply rejecting the shader) is to have some mechanism that tells the users exactly what resources will be used. And I mean exactly, from the number and sizes of extra buffers that will be allocated, to the overall cycle-count per-pixel/vertex, to the number of texture accesses/filters that will be used.

This doesn’t remind me of arguments of assembly vs C/C++. It reminds me of arguing with people over the fesibility of a CPU/computer that actually natively understands C/C++. I don’t mind having the compiler layer between my C and the assembly layer.

What I would not be adverse to seeing is an off-line shader compiler that generates a “shader object” file (a .c file). The application can link the shader in and tell it to run given a set of data (via a reasonable OpenGL-esque interface). It makes it easy to see exactly what the shader will need. Not only that, it makes it easy to go in and optimize a shader by doing certain operations a different way.

I could see each OpenGL implementer creating a module to the compiler that generates optimized C code for their particular hardware.

[This message has been edited by Korval (edited 07-01-2002).]

davepermen · July 1, 2002, 9:44pm

hum you got me wrong. (and no coffee here…). the guy wants a general multipass shader compiler for todays hardware. todays hardware has too restricted, too different parts in. that future hardware will drop these “faults” is pretty logical. pixelshader and framebuffer access possibly merge sometimes, so that we always have access to the framebuffer"texture" if we enable blending. for example. but on todays hardware getting a shader working over severall passes automatically is just impossible fast. the stanford shader language demos do work at home, as well as they do here. but they don’t work smooth at all on a geforce2mx. and this for simple one-mesh-demos. doing the right optimizations for having the shaders fast on the specific hardware means knowing the hw posssibilities till the last bit. and on todays hw (okay, i know only the gf2mx by heart as i never coded for gf3 or such yet… sooner or later a gf4ti4200, we’ll see), setting up some registercombiners and blendfuncs and everything is simply too complex for a compiler. i helped in optimizing quite a lot of rc’s yet and i’ve seen tons of restrictions of this and that here and there and well… that thing is NOT like a cpu at all…
for the vertexshader_callbacks, yes, they can cut to multipass, i think… you set up tex0 to tex8 for example and the shaders get compiled into two callbacks, one for pass0 and one for pass1, wich only set up the specific needed texcoords.
but for todays pixelshading capabilities (mostly multipass pixelshading, see the huge doom3 topics about how to implement a general lighting equation and you see where the real problems lie in todays hw and “generic programming” (what your shader then would need to be… sort of)), i don’t think there is a general AND FAST way… (well… on the radeon it can yet be quite generalized with multipass to just look up textures in dependend ways where ever you want… i think, but on a gf3 or gf4, the dependend texture accesses with texshaders are quite restricted anyways…)

PROVE ME I AM WRONG.
haven’t seen any working implementation of you yet…
http://tyrannen.starcaft3d.net/PerPixelLighting

your compiler has to generate such thing with --optimize:speed tag. one pass for the equation i plugged in, two pass for --optimize:no_approx (this for gf2mx, for gf3 and more it would then set up texture_shaders and do the math in 1 pass with optimize so that normalizations get wrong, and with --optimize:accurate it would set up multipass with normalizations and rendertexture for depenent texture accesses and all… --optimize:accurate on gf2mx would set up a huge multipass system to do the normalizations for all the vectors as well.)

think you get the idea what could be a problem on TODAYS hardware.
on future hw i don’t see much problems… a generalized gpu like a p10 doesn’t have any problems to implement multipasses and singlepasses in the same way…

knackered · July 1, 2002, 11:02pm

Originally posted by davepermen:
hum you got me wrong. (and no coffee here…).

Well in that case, DRINK SOME!
Seriously, you’ve made your point (something about 3dcards being hugley limited), but there are people who design the hardware discussing the ‘future’ of graphics co-processors here…not ranting on about the limitations of todays hardware…if people had your attitude then we’d still be using single textured, flat polys, with software t&l.

folker · July 1, 2002, 11:23pm

Originally posted by mcraighead:
[b]I honestly see it as infeasible, or at least much worse in performance.

The only way I can imagine implementing it in the general case is to use a SW renderer. (And that’s not invariant with HW rendering!)

Matt[/b]

No, it is no problem to support it in hardware very efficiently.

In the OpenGL 2.0 spec it is an central aspect that the driver can split a complex shader program into multiple passes automatically.

This is based on auxiliar buffers for storing intermediate results. Current hardware does not support it, but it is no real problem to support it in hardware. OpenGL 2.0 is designed explicitely for it.

The OpenGL 2.0 designers know very well what they are doing. And Carmack is absolutely right: Splitting complex programs automatically is the real quantum leap to get real hardware independence, so that you don’t have to write different shader programs for different hardware.

For details see the OpenGL 2.0 specs.

dorbie · July 2, 2002, 12:48am

A key concern I think is handling multipass transparently and efficiently. It’s one thing to say you’re going to use auxiliary buffers quite another to thrash state behind the scenes between packets of data. But if you think about this, it might be a powerful argument FOR OpenGL 2.0. Implementations with big FIFOs (they all do)can package this up at a granularity of their choosing after compiling to target their pass requirements and persistent register count between passes (aux buffers). Other implementations with recirculation or more units can compile to a single pass.

Applications wrestle with this now, we have Carmack using destination alpha to store terms between multiple passes and losing color information because destination color is accumulating final results on one implementation, and doing it all in a single pass on another with a completely different API. There have been discussions you’ve already had where we’ve talked about using pbuffers an reprojecting fragment results back onto the database to get around these issues. This is not a pretty picture. There is no good solution today, not even a hand coded one. In other words applications don’t have significantly better options today even with hand crafted hacks like destination alpha as a persistent register(the magic sauce to pass more than a single value between one pass and another). What is possible today is not the issue, we know it sucks wind.

So the real debate was framed by Carmack quite well although he pretty much dismissed the couter arguments with a glib “get over it”. His main desire (I think) is to write a single path without having to ask the hardware how he needs to split up his code to implement an effect, it’s a drive for transparency. The question is how much does that cost. Opinions differ but I think the correct answer depends to on your timescale, neither option is particularly great TODAY, Cg has the slight edge because it’s intentionally dysfunctional with the promise of better in future, but long term where should we be headed?

Just as an interesting aside, at what point do Cg and OpenGL 2.0 converge?

[This message has been edited by dorbie (edited 07-02-2002).]

davepermen · July 2, 2002, 2:32am

Originally posted by knackered:
Well in that case, DRINK SOME!
Seriously, you’ve made your point (something about 3dcards being hugley limited), but there are people who design the hardware discussing the ‘future’ of graphics co-processors here…not ranting on about the limitations of todays hardware…if people had your attitude then we’d still be using single textured, flat polys, with software t&l.

if you know me you know i’m the last wanting single textures flat polys with software t&l and its a shame to get blamed to think of this.

he asked for driver developers to do it in todays hw and i just want to show up how nearly impossible it is.

for future hardware and future hardware design, its no problem at all, thats logical. and i can’t wait for the future hw, because well… with my gf2mx i don’t have much of the new fancy features… but anyways, its much possible even on this hardware, but to code for this you have to do it manually…

for future hardware the framebuffer should optionally be bindable in the fragmentshader, that would solve much of todays issues. (blending with rc’s power…). and multiple buffers to draw in. but anyways… we’ll see what is comming. as the hw gets more generally programable multipass will not really be something we touch anymore. possibly. but not based by teh api, thought… instead some higher level interface (cg? ) wich does this. direct hw access should still be there, the rest is part of some d3dx equivalent (glu,glut, etc… rebirth of them )

and no coffee for me. as long as i can stay awake without…

dorbie · July 2, 2002, 3:02am

Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don’t just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors. Depth should be handled similarly.

As a personal preference:
at this point your combiner (or whatever you call it) replaces blend equation / blend func hardware completely and you use destination color as texture register to implement what we now call blendfunc + other goodies.

That would seems quite clean programatically and avoids the inevitable glMultiBlendFuncEXT crap someone is bound to ask for, it also makes framebuffer fragment processing another part of a texture unit. Basically eliminating a big chunk of orthogonal and increasingly redundant functionality while bringing texture functionality like DOT3 to framebuffer blending.

The framebuffer just becomes one of several optional ‘persistent registers’.

Even if blending is just a special ‘final combiner’ initially I’d like to see this be the direction things move in.

davepermen · July 2, 2002, 3:26am

yeah… finally dropping the stencilbuffers, alphatests, depthtests and and and… instead, we then have two rgba buffers (both with 32bits per component), to where we can render and from where we can read as we want… (meaning buffer2.a == depthvalue, buffer2.b == shadowvolumecount and such stuff). just generalizing the stuff to say we have say 32textures max, and say 8 independent grayscale 32bit buffers (one buffer with 8values in, but thats hw implementation), and the first 3 of them are the screenrgb. rest use for what you want…

that way we could generate fancy dephtbuffershadowing with the seconddepth, more or less order independent transparency (independent per mesh is quite important )

am i only dreaming?..

and then… dropping the rastericer in the middle and let the vertexshader output on screen…

davepermen · July 2, 2002, 3:31am

and finally support for rendering tetraedric objects… (glBegin(GL_TETRAEDER);glEnd()
that means, rendering 4 triangles at the same time with 4 points defined, getting both the min and maxdepthvalue in teh combiners, “clamp” them into the framebuffer depthrange (means if max>frame.depth,max=frame.depth). that way we can finally render real volumetric objects. fog and such, no problem anymore…

but i think that is still FAAAAAAAAAAAR away… but as filtering and sampling gets programable, actually with programable anysotropic filtering you could yet sample over a line, meaning you could sample through a volumetric texture over a line and get the result. rendering real volumetric textures… that would be cool…

dorbie · July 2, 2002, 3:49am

I was thinking of more that two mere color buffers. Color would just be one use of a whole bunch of general persistent registers.

As for drawing fog, I assume you mean “tetrahedra”. You should look at the SGI volumizer API, you might find it interesting. If you can output the depth value to a persistent register you can do what you want without multiple simultaneous source fragment generation. In anycase the fog volume stuff has aready been done for arbitrary shaped volumes (or the equivalent) on current hardware, there are several tricks that make this possible. Storing intermediate results to auxiliary buffers which could then use dependent reads to apply an arbitrary function would make it even simpler to implement. You could even wrap it in a tetrahedra interface but it would be inefficient. Polyhedra for homogeneous or textured fog (and other gaseous phenomena) would do.

This link may help: http://www.acm.org/jgt/papers/Mech01/

[This message has been edited by dorbie (edited 07-02-2002).]

folker · July 2, 2002, 4:51am

Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don’t just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors.

This is exactly what OpenGL 2.0 does
-> see OpenGL 2.0 specs.

[b]
Depth should be handled similarly.

As a personal preference:
at this point your combiner (or whatever you call it) replaces blend equation / blend func hardware completely and you use destination color as texture register to implement what we now call blendfunc + other goodies.
[/b]

This is also exactly what OpenGL 2.0 does
-> see OpenGL 2.0 specs.

But note that in OpenGL 2.0 there are still also the standard depth test and blending units which can be combined with fragment shaders due to performance reasons, since these fixed function units are much faster. For example, the fixed depth test can get a major speed-up by a hierarchical z buffer.

That would seems quite clean programatically and avoids the inevitable glMultiBlendFuncEXT crap someone is bound to ask for, it also makes framebuffer fragment processing another part of a texture unit.

No, accessing textures has fundamental differences to accessing the framebuffer or aux buffers. For example, think of the gradient and filtering aspects.

davepermen · July 2, 2002, 5:53am

why, folker? both are just arrays of pixels/texels. you can filter bilinear from the framebuffer as well…
what i want is just dropping the framebuffers/pbuffers and textures, and instead just do one thing. (about what dx does)
gl2BindDrawBuffer(GL2_RGBA,texID);
gl2BindDrawBuffer(GL2_DEPTH,tex2ID);
drawing onto them();
gl2Finalize(texID);
gl2Finalize(tex2ID);

or so… and use them and draw onto them and bind them and read from them etc…

and actually dorbie means that the framebuffer gets its values into the rc’s as constants perpixel. means simply the same data as you get in the blendingequation but yet in the registercombiners… (i thought about that at least, i think dorbie as well)

imported_John_Pollar · July 2, 2002, 7:48am

he asked for driver developers to do it in todays hw and i just want to show up how nearly impossible it is.

I know it’s probably hard to emulate alot of shaders on todays HW. But using a render target as an intermediate result, putting restrictions on what TMU’s can depend on each other etc, you can almost pull it off in alot of cases.

yeah… finally dropping the stencilbuffers, alphatests, depthtests and and and… instead, we then have two rgba buffers (both with 32bits per component), to where we can render and from where we can read as we want…

Ohh. Something I’ve been wanting for awhile. For some things I need to do, I need two zbuffers, and I need to test from each one using different comparisons.

Rather than special casing all these buffers (stenil, alphas, zbuffer like you say), we will eventually need access to the frambuffer in the pixelshader pipe. Matter of fact. Just get rid of the term “framebuffer”. Just set a texture as the “active” texture. This replaces the active framebuffer. I can write my own stencil shader code in this case.

We can sort of do this now using render targets, but it’s not quite there yet.

[This message has been edited by John Pollard (edited 07-02-2002).]

folker · July 2, 2002, 8:11am

Originally posted by davepermen:
[b]why, folker? both are just arrays of pixels/texels. you can filter bilinear from the framebuffer as well…
what i want is just dropping the framebuffers/pbuffers and textures, and instead just do one thing. (about what dx does)
gl2BindDrawBuffer(GL2_RGBA,texID);
gl2BindDrawBuffer(GL2_DEPTH,tex2ID);
drawing onto them();
gl2Finalize(texID);
gl2Finalize(tex2ID);

or so… and use them and draw onto them and bind them and read from them etc…

and actually dorbie means that the framebuffer gets its values into the rc’s as constants perpixel. means simply the same data as you get in the blendingequation but yet in the registercombiners… (i thought about that at least, i think dorbie as well)[/b]

Such an universal design would be possible of course, but you basically would give up the performance advantages of GPUs compared to CPUs. The reason why GPUs are much faster in 3d rendering compared to CPUs is that the do NOT have such an universal design.

Both texture access and frame buffer writing is optimized by taking advantage of the fact that there are no side effects, which is the key for massive pipelining and massive parallelization (16, 32, 64 texture access units / fragment units in parallel in hardware etc. etc.): Textures are accessed random access (optimized by texture swizzling), but only read -> no side effect. Frame buffer is read-modifiy-write, but (basically) linear, which is the reason that fragment programs can access only one pixel
and not neighboring pixels. If you would have random read-modify-write access to textures or framebuffer or whatever memory, you basically give up the performance advantage of GPUs and are back to CPU software rendering. And there are no advantages compensating this disadvantage.

The clever trick of 3d in hardware is to use techniques which can be implemented much faster than usual CPUs (massive parallelization), but try to be still as flexible as possible (-> vertex, fragment shaders).

Coop · July 2, 2002, 10:00am

Originally posted by dorbie:
Framebuffer as texture register solves part of the problem, but you need more than one framebuffer, and I don’t just mean a pbuffer or something, you need to simultaneously OUT multiple values from texture units in a single pass to mutliple destination colors. Depth should be handled similarly.

From what is already known about DX9 (see the link below), the DX9 class hardware is going to support a simultaneous writing to up to 4 render targets from one pixel shader (one z/stencil for all). Z/stencil test is the only thing that will work at the pixel level - there will be no blending, alphatest, etc. In my opinion it makes the whole multirendertarget idea pretty useless.
http://download.microsoft.com/download/whistler/WHP/1.0/WXP/EN-US/WH02_Graphics02.exe

[This message has been edited by coop (edited 07-02-2002).]

Nakoruru · July 2, 2002, 10:38am

Why would you want to bilinear filter from the framebuffer? This is not a good idea at all. First how are you supposed to know what is in the framebuffer except what is directly below the pixel you are rendering? Rasterizers do multiple pixels at a time in parallel, so sampling anything but the pixel directly underneath the one you are rendering would make it impossible to even know what you are going to get.

Explain to me how you sample an image that is in the process of being rendered by 4 or more parallel pixel pipelines?

It makes so much more sense to render to texture and then sample from that to do interesting filters. At that point you have an exact idea of what you are sampling. Just look at the ATI demos of things like edge detection. They say not to be afraid of render to texture anymore.