Carmack .plan: OpenGL2.0

Originally posted by mcraighead:
It does you no good to write a single shader that runs on all platforms, but runs at 30 fps on the newest hardware and 0.1 fps on all the old hardware!
(…)
Ideally, your app runs at the same framerate on all hardware, but the complexity of the shading or the scene in general changes.

Why not introducing a LOD concept into shading? As we now have mipmaps for lookups, we could get “mip shaders” in future GL releases. Then one shader could be used at same framerate on all platforms, given the proper LOD bias…

Julien.

Originally posted by Nutty:
On the 3dlabs P10 based cards?

I thought Carmack was talkin about NV prototypes ext. Maybe you’re right and i’ve misunderstood what he’s written.

Can someone clearly define the problem of passing multipass rendering over to the driver? I’m having trouble decyphering what is being said (not having much of an insight into low level hardware gymnastics)…

As far as I can see:-
The output of pass1 must be available as the input to pass2 in exactly the same form as the outputs between exisiting texture stages in a single pass - ie. float texture coordinates for all units, float fragment registers…etc.
Is this correct?

Theoretically, the GL2.0 prototypes have not been reviewed by the ARB yet. But 3DLabs must have some beta versions for developpers, even though I didn’t find them on their developper site. Does anyone know where one could find them, together with the specs?

I have to partially agree with matt here. While getting functions to automatically multi-pass on old hardware is a pretty thought, it is way less than practical today. I see no way to make a shading setup reliably take advantage of a geforce 4 while still scaling back to a TNT. Its just not reasonable because of the completely different set of functionality.

However, I think we need to (at some point) stop writing a dozen different shader paths for everything. As I see it, the hardware coming up (GL2/DX9 compiant) should be flexible enough to do most anything we need to throw at it for quite a while. That is where we need to begin to simplify things, from that point forward. We should be able to just throw a shader, have it calculate the result to a p-buffer if required, and render the final result to to the frame buffer as if it were a magical single pass.

As far as dealing with supported features, we could still use some sort of caps bit and select how each feature is to be implemented. Something like:
I need to take the log7 of each fragment (I’m just making this up, ok). I see the hardware supports logn (any base). Well, Ill just use that. No wait, it only supports log10. Then I want the hardware to calculate log10(fragment)/constant_value(log10(7)). Oh wait, you mean the hardware has no log support? OK, then I want it to look it up from this texture table using a dependant read. etc.

The point is, we would get the ability to make simple, confined decisions. I can do a simple:
if (supported(LOG_BASE_N)) apply(LOG_BASE_N)
else if (supported(LOG_BASE_10)) apply …

Each decision would be confined to a single feature. We would no longer have to worry about the combinatorial explosion that we do today. Today we say: oh, thats simple, Ill do this calculation instead…but now I need to break it into 2+ passes. So here is an optimized 4 texture version, a 6 texture version, an 8 texture version, a 4 texture version in case this other feature isnt supported and we have to break into 3 passes, etc.

So again, its still on us to decide how to do each little part of the equation, but how to get that equation into 1 or more passes is the driver’s responsibility.

I dont understand the invariance issues you are talking about matt. If I pass the hardware a single x/y/z with 25 sets of texture coordinates, the hardware should be able to keep coming up with the same fragment-z for every pass. I understand some hardware isn’t invariant between 2 different render setups, but I think maybe we need to progress to the point where we say it does have to do that, and the IHV’s make it happen. I personally think its kinda ridiculous how for some cards, even making some stupid little render setup change means you lose invariance. Maybe there is a very real reason why hardware does it that way, but “because thats the way it has to be done” isnt a good enough excuse for me. Make the hardware so that it doesnt have to be done.

So maybe GL2/DX9 cards wont have all the features necessary to do this, but we need to make it a top priority to say that in the next couple of revisions, we need to get these features in so we CAN do this for the long run.

The invariance issues that Matt is talking about makes a lot of sense but he’s talking about implementing this stuff on hardware that isn’t flexible enough .

If I’m not mistaken, the transition from OpenGL 1.3-1.4 to OpenGL 2.0 is done via new extensions. I think that’s what John Carmack is doing with DOOM 3. It’s just like what we’re doing today - seriously, are you using extensions or do you require full GL 1.3 support ? I still bind extensions and require only GL 1.1. That’s the sort of transition I’m thinking about.

With the right hardware, I think Matt would agree that driver multipass is a good thing ( right Matt ? ).

A few posts ago, when Cg first came out, I made the similar comment that John C made (of course, I’m no John C) regarding having the API/driver/whatever break up complex shaders into multipass automatically.

I think the gist of all of this is that the current state of affairs regarding shaders is a mess. It’s possible to make cool shaders, but there is no guarentee that they’ll work across a wide range of hardware and one has to program for DX shaders, NV OGL extensions, and ATI OGL extentions.

Perhaps hardware vendors need to go back to the drawing board and rethink their designs so that future hardware handles arbitrarily complex shaders and become more of a general purpose processor with some specializations for fast fill rate, texture lookups, AA and other things.

Matt,

First on the normalize thing, that was an example I used because I felt it would be one just about everyone here was familiar with. Now, with the GL2 shaders as defined, you still could make the decision to bind a normalizer cubemap if you wanted. Also, your arguments about the precision are somewhat mute. All the whitepapers/specification suggest setting minimal ranges/precisions for operations that a conformant implementation would be required to comply with. As for the scientific usage issue, I would argue that they would require additional info from implementors just as they would with today’s requirements. GL only requires floats with 17 bit accurate mantissas for vertex transformations today.

On the multipass issue, you vastly oversimplified things. The requirements stated in the proposals would require that the results be as if the data went into the buffer in a single pass. This means that in the case of blending an F-buffer style implementation or writing to an intermediate buffer may be required. This would be a burden on the driver in the short term, and additionally, it would occasionally have a characteristic where ISV’s would know that going beyond some certain limit is not a good idea on generation x of HW. This will always exist, as it is going to be quite a while before we can run million instruction programs per-fragment at 60 Hz.

You also missed my point on shaders working across multiple pieces of HW. I agree, less agressive shaders will be necessary. On the other hand as someone else mentioned, then it becomes a problem of writing a few less aggressive shaders rather than one for every single permutation of HW to run as effectively as possible. Also, it allows the heaviest shaders at ship time to run as best they can on HW released after ship. I can bet you have come across a fair number of apps that were released before the GeForce3 that only use 2 textures per pass for what is only a 3 or 4 texture effect. I would wager a fair sum, you would like an easy mechanism to collapse those into a single pass.

Now, I’ll stand up and say that I don’t necessarily agree with everything in the GL2 white papers or all the things that get said about it. It is not a cure all, and sometimes make it seem that way. Yes, their is still going to be plenty of work for app developers and artists. I do think that some of what has been said here are misconceptions (such as the blending thing and applicability to old HW), and I do want to make sure that this stuff gets represented in the correct light.

As for the discussion I have seen about developers needing to control exactly what is going on, I would argue that is a real stretch. Few people today program in ASM for full applications, and even those that do on x86 machines aren’t really programming to HW anymore anyway. (The instructions are decoded to micro ops and …) Then there are systems like Java and MSIL (.net VM) (Now this sort of thing gets into a holy war I don’t wnat, so just take this at face value.) These are ways of distributing a program that runs (reasonably) well across multiple processor architectures. I would argue that the challenge of shipping shader code that runs across NVIDIA, ATI, 3DLabs, Matrox, Intel, etc is very similar to the challenge of shipping code that runs well across x86, MIPS, DragonBall (Palm processors?), Sparc, etc.

As for your disagreement, that is fine, as I stated I don’t necessarily agree with everything either. This is still a proposal, and up for debate. I think the overall concept is quite sound. As for your retirement from graphics Matt, I would hate to see you go so soon.

-Evan

IT,

As for the shame of the multitude of shaders, I think all the IHV’s agree on this. The ARB recently adopted a standard vertex program extension. There is also a push for a standard fragment program extension.

These serve an near-term need to get stuff out the door. The GL2 stuff is a more long term look to virtualize the resources used by the shaders.

I don’t know exactly what Carmack is trying to say, but I can tell you what I hope he’s trying to say, and what I would like to see.

I would like to see an infinite level of simultanaeous texture stages supported. When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass. But, sometimes, this isn’t possible, because the math is impossible to work out, because the last pass is dependent of data in the framebuffer, which is screwed by passes of previously rendered objects (I know there are ways around this in some cases, but it’s just annoying).

The solution? The HW should easily be able to support an infinite amount of texture units. Lets say I need to do 7 passes, but have 4 texture units. The card would do the first 4 passes (simultanously), store this somwhere, then do the last 3 passes (simultaneously), then combine these results, and write to the frame buffer. This avoids alot of re-transforming of the objects, and is just plain easier to code for.

There is no excuse that I can think of, for not supporting this. You still supply all the textures, math, etc. The driver only simulates more than it’s actual texture units. Thats it.

There is no excuse that I can think of, for not supporting this. You still supply all the textures, math, etc. The driver only simulates more than it’s actual texture units. Thats it.

Why make the driver do that work? It’s nothing the end-user can’t do easily enough. Not only that, I don’t like the idea of the driver spending the time/resources that it takes to perform this operation. I would much rather know that the hardware only supports 4 multitextures and simply opt for a different shader than to waste precious time doing some multipass algorithm on older hardware.

Why make the driver do that work? It’s nothing the end-user can’t do easily enough. Not only that, I don’t like the idea of the driver spending the time/resources that it takes to perform this operation. I would much rather know that the hardware only supports 4 multitextures and simply opt for a different shader than to waste precious time doing some multipass algorithm on older hardware

Well, I don’t want the hardware to have to transform, and clip the geometry all over again, each time I have to do extra passes. We are dealing with 50-80k+ tri scenes in DNF, and it gets expensive REALLY fast.

Second, sometimes it’s just not possible to get the math right, unless you use a render target to store temp results. You can also use the alpha channel to store temp results, but this is crazy, the driver can do a much better job in this case. They have the transformed geometry, they just need to make several passes on this data at a low level.

Think of it like memcpy. It subdivides the bytes into ints, then words, then takes care of the left over bytes. But in this case, the HW would subdivide the work by how many texture units it has.

Don’t forget though, you could still decide to do it yourself if you need to, but I doubt it.

Can you give me an example where you’d rather do it on your own, because the hardware couldn’t do a better/similar job? Because maybe I’m just overlooking something, so I’d like to hear more oppinions.

Originally posted by John Pollard:
…When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass.

What do you mean ? If you do 5 passes you need to transform the geometry 5 times. Is there a different interpretation of the term ‘pass’ in Direct3D ?

Well, it would be 5 passes with 1 texture unit. 1 pass with 5 texture units.

A pass in this case, is how many DrawPrimitive calls you had to make, to pull off the effect (sorry to use Direct3D terminology).

When I need something done in 5 passes, and only have 4 texture stages, I have to draw the primitive a second time, to take care of the 5th pass

Heh, I can see where this was confusing. Sorry, trying to bbq, and do this at the same time

A better way to say this, would be:

When I have an effect, that requires 5 TMU’s, but I only have 4 TMU’s to use, I will have to do it in 2 passes.

They have the transformed geometry, they just need to make several passes on this data at a low level.

I don’t know where you got the idea that the driver has the transformed data, but that is incorrect. In every card since GeForce1, when rendering with the hardware T&L/vertex programs, the driver has no access to the post-T&L results. Therefore, in order to multi-pass, it will have to retransform the verts again.

Not only that, more complex shader algorithms need special vertex shader code to mesh with various textures in the pixel shader. The driver would have to take your shader, break it into two pieces based on which data is necessary for which textures/math ops in the pixel shader for that pass. In all likelyhood, in hardware that causes a shader to fall to multipass, you’re going to have to send different vertex data to each pass (different sets of texture coordinates, vertex colors, etc).

Also, you’re making a fundamental assumption. You’re assuming that I want the more powerful shader program run on any hardware, regardless of the cost. It may require a render-to-texture op coupled with a blitting operation. Not only that, a 5-texture algorithm with hardware that only has 4 texture units may actually require 3 (or potentially more) passes, depending on what I do with those textures and how I combine them in the shader. Each pass will need its own vertex and pixel shader code, which has to be generated on the fly from the shader code passed in.

Given that I might be writing a high-performance appliation (like a game), I may not want to pay the cost of a 3-pass algorithm, when I could use something that doesn’t look as good, but is cheaper. Not only that, I have no idea how long a particular shader is going to take; therefore, I don’t get to have fallback options given particular hardware. In short, I still have to code for the lowest-common denominator, since coding for the high-end guarentees that the low-end users will get horrific performance.

Originally posted by John Pollard:
We are dealing with 50-80k+ tri scenes in DNF, and it gets expensive REALLY fast.

Yeah, but by that time…

I don’t know where you got the idea that the driver has the transformed data, but that is incorrect. In every card since GeForce1, when rendering with the hardware T&L/vertex programs, the driver has no access to the post-T&L results. Therefore, in order to multi-pass, it will have to retransform the verts again.

Someone has the data, I really don’t care who. I just know I pass it in. I’m not saying things wouldn’t need to be re-arranged to support this.

you’re going to have to send different vertex data to each pass

Yes, exactly. I can send UV vertex data for 8 TMU’s, even though the HW only has 4 TMU’s. This means, on the first pass, the HW would interpolate the first 4 UV’s, then on the last pass, interpolate the last 4 UV’s. Not all that complicated.

You’re assuming that I want the more powerful shader program run on any hardware, regardless of the cost. It may require a render-to-texture op coupled with a blitting operatin.

All the HW has to do, is subdivide the shader into parts. Each part would go through the shader engine. Then, each part would get combined. I can’t think of any scenerio where there wouldn’t be a solution. It just might be a little slow, worse case.

But it means less work for me, and probably isn’t going to be any slower than any fallback case I would need to write to support that effect anyhow. But I don’t see how it would be slower, since alot of the work isn’t being duplicated anymore.

Given that I might be writing a high-performance appliation (like a game), I may not want to pay the cost of a 3-pass algorithm, when I could use something that doesn’t look as good, but is cheaper.

You can still do this. Nothing is stopping you from coding the way you always have.

Though, there is the drawback of not being able to calculate the number of cycles a particular shader is going to take (in the case the HW had to subdivide the workload). But this is a luxury, and I can handle that. When coding, I would calculate cycles based on the fact that the the HW didn’t need to switch to fallback mode. Of course, I would also allow the user to turn a feature off, if his machine didn’t have the goods.

I just think it would be really cool to write a shader targeted for the GF3, but still have it work on a GF1 (at a price of course).

I guess I’m in dreamland though, and I’ll just have to keep writing 10 different code paths to support the different combo’s of cards Such is life…

Originally posted by Ozzy:
I thought Carmack was talkin about NV prototypes ext. Maybe you’re right and i’ve misunderstood what he’s written.

John Carmack was kind enough to write some OpenGL2 shaders last week and give them a try on a Wildcat VP (the official name for a P10 board), as he stated in his .plan file.

The standard drivers that ship with a Wildcat VP do not have OpenGL2 support. The reason being that our OpenGL2 implementation is still in the early phases, and we do not want to mix that with production quality drivers. But I’ll be happy to provide an OpenGL2 capable driver to anyone with a Wildcat VP board, who wants to experiment a bit. Just drop me an email.

At the ARB meeting a week and a half ago we presented a plan for the ARB to work on getting parts of the OpenGL2 white paper functionality into OpenGL in incremental steps. we presented 3 extensions and the OpenGL2 language specification. The extensions handle vertex programmability, fragment programmability and a frame work to handle OpenGL2 style objects. As a result a working group was formed by the ARB. This working group is headed by ATI. Jon Leech should post the ARB minutes on www.opengl.org shortly, if he didn’t already.

Barthold,
3Dlabs

John,

I think you’re on the right track. While both the NV20/25 and R200 support some kind of loopback to extend their texture stages, one could still argue that it’s just ‘pipe combining’ and you can’t go over your total limit of physical TMUs (of all pipes combined).
That’s the easy way to do it, I believe, but what do I know about these chips, really …

Well, but proof for the existance of true loopback is the Kyro II. Ok, no promoting or bashing here, we know it all. But this little thing does 8x loopback in D3D and 4x in OpenGL. And I could well imagine that it’s an arbitrary driver limitation, that it doesn’t support unlimited loopback (or perhaps “close to unlimited” ).

So I wonder, why can’t the big dogs do that?