Response to "Cg won't work" op-ed in The Register

Disclaimer: Note that I haven’t worked at all on Cg. Also keep in mind that the below are all personal opinions, not necessarily reflecting the views of my employer. Indeed, I’m actually on vacation from work right now, after my recent college graduation…

Eric, your post reminds me of an interesting point – the “ASM” vs. “HLL” issue, as it relates to the API. The way I see it, what makes the most sense from an API standpoint is to expose an assembly-level language from the graphics API (recall that OpenGL is supposed to be a “low-level” graphics API), and to layer a HLL compiler on top rather than integrating it into OpenGL. I think it makes very little sense to make a shading language part of the API itself.

This is the approach that we have taken with NV_vertex_program and Cg, and I think it’s the right design decision. You can precompile shaders; you can examine what the assembly looks like; you can have an API-independent runtime layer that works with more than just OpenGL; and, of course, you can still write assembly programs when you need to!

This is a personal beef of mine with the OGL2.0 proposals – I really don’t think it makes any sense to put a HLL inside the OpenGL API.

OGL1.4 is taking the right approach by exposing an assembly language from the API. The assembly language can be upgraded later, but it will certainly be functional in its initial form and a viable compiler target platform. So, a sufficiently inclined individual or company could write a compiler from the proposed 3Dlabs shading language to ARB_vertex_program, right now, today. The only reason to wait and make it part of “OGL2.0” is for marketing reasons.

If you wanted to be really picky, you could have a shading language as part of a “GLU 2.0”. This layer would simply call glLoadProgramNV (or the ARB_vertex_program equivalent thereof). This analogy isn’t entirely accurate because GLU generally doesn’t have a driver layer that allows different vendors to plug in their own implementation. However, this analogy does make it clear how a separate layer can work, and also illustrates how “standardization” is a straw man for putting the shading language in the base API. It’s perfectly possible to standardize a shading language and even a shading runtime that sits at any layer.

A final argument that has been made is that it is valuable somehow to not support an assembly language, because it eliminates some sort of backwards compatibility burden. But since ARB_vertex_program isn’t going away (much less NV_vertex_program or DX8), this is a burden that will already exist by developer demand. In the very worst case, you could “compile” an ARB_vertex_program into a high-level program. This assumes that the HLL has the same set of program inputs (i.e. vertex attribs) and outputs as the ARB language; but I think that’s a reasonable assumption. There’s no reason to change the semantics of input/output behavior just because you are putting in a high-level language.

Oh, what I’d give to be able to have some honest discussion of the 3Dlabs OGL 2.0 proposals… just look at the poll on this site about whether you’ve reviewed the proposals. If you’ve reviewed them, you have the choice of either “fully support[ing]” them or “want[ing] to learn more”. There is no option that lets you say that you disagree with many of the design decisions, as I do.

But I’ve already probably spoken too much about this sensitive topic…

[Then again, isn’t that precisely the problem? Those of us who’ve worked on drivers for years, who live and breathe OpenGL, who may have many criticisms and disagreements have our tongues tied for political reasons, while developers just look at the proposals and see that there’s all this stuff in them, and wouldn’t it be nice to just have every feature in the world… I see it as a set of tradeoffs and design decisions, and see what I think are the wrong ones being made, and I can’t even tell anyone what I’d like to see changed, even if they might agree with me.]

Okay, now I should really shut up.

  • Matt

I hadn’t thought of this problem (i.e. the HLL being part of the API) but reading your post I begin to think that it may not actually be such a good idea…

I may start a war here but the main problem I have with OGL 2.0 is that it is just that: a proposal.

Seeing how long it takes for the ARB to promote one single extension, I cannot see how OGL 2.0 could be made available before several years (I mean, with proper drivers, not just a sample implementation).

As far as I am concerned, OpenGL 1.4 seems to be what I have been waiting for, and I must say that Cg looks like it could be quite helpful in this case…

That being said, that is just my opinion. After all, I am just a poor lonesome coder…

Regards.

Eric

P.S.: another thing I’ve been dreaming about is OpenML… Think I saw the first announce about that 2 years ago. Where are they now?

Originally posted by Eric:
P.S.: another thing I’ve been dreaming about is OpenML… Think I saw the first announce about that 2 years ago. Where are they now?

Well, the 1.0 specification was released a while ago, if that’s what you mean.

My 2 pence on Cg… the concept is great, the politics suck. A language controlled by one vendor is always going to be developed in such a way as to favour that vendor. So either it flops, or (more likely) it’s pulled into DX9/DX10 as an “official” high-level shading language, cements NV’s market lead and discourages experimentation by other vendors. (Man, I’m getting old… I can remember the days when you could still use the word “innovation” and have it mean something… :wink: )

Given that Cg occupies about the same niche as the GL2 shading language, it’s a bit disappointing that NV couldn’t lend their efforts to developing a real open standard instead of producing an encumbered de facto one.

As I’ve said before, I’m a big fan of NV. This isn’t an anti-NV rant, it’s just pointing out (as others have done) that what’s good for NV-the-company isn’t necessarily good for the users. Or for that matter, for NV tech guys, for whom a healthy competitive marketplace means great salaries and cool work to do.

Originally posted by Eric:
[b] I am sorry but I am in the real world, you aren’t: I am very conscious of the commercial issues behind what NVIDIA is doing (although I think the guys who are developping Cg are not the ones who are commercially interested in it…).

You complain about this commercial side of things while this is something you should expect these days. Who’s in DreamLand then?

Regards.

Eric[/b]

Eric, I didn’t mean any offense to you.
Why the agressive reply? Have you shares in nvidia plc?
The more I think about this whole Cg thing, the more depressed I get (in terms of the future of opengl, not life itself! ).
OpenGL is just becoming a mongrel - a messy experiment.
Now go on, insult me!

(I’ve just discovered the bold tags, in case you haven’t noticed!)

P.S. I’m gutted about the England v Brazil match - so excuse my bad mood.

I don’t think Cg’s design really “favors” NVIDIA in any way. The profiles may be “dumbed down” to present hardware, but the language itself is just a programming language. It’s not as though the C programming language “favors” x86, Sparc, Mips, or Alpha… and it would likely be safe to say that future profiles will support larger subsets.

I suppose there is one other difference between a layered and a built-in language – handling of multipass. Unfortunately, transparent handling of multipass is very hard to do in an OpenGL driver; you’d likely need hardware support for an F-buffer, or it would only work in certain special cases. It makes more sense in a scene graph API or game engine to talk about transparent multipass. Even there, though, it can be difficult if all you have is destination alpha to carry forward intermediate results. A pbuffer or aux buffer could help.

In practice, the driver might just have to fall back to software for any shader more complicated than it can handle. Transparent multipass would be cool – sure. Feasible? Unclear.

Similar problems are faced in designing a multivendor vertex/fragment programmability API, because not every vendor has the same underlying instruction set, and because in many cases underlying restrictions can’t be revealed to end users. I am quite happy with the approach the ARB settled on for this particular issue (i.e. limitations on how “big” programs can get, and associated queries) with ARB_vertex_program.

  • Matt

My 2 cents:
OpenGl 2.0 is something more then just shader languages.
For me, shaders are not even primary priority. More important is fixing well known OGL flaws:

  • render to texture (please recall recent threads full of bitching on wglShareLists, wglMakeCurrentRead)
  • texture objects (texture-targets are obsolete and annoying)
  • vertex array objects (no more CVA, VAR, VAO, MOB mess)
  • shader objects (no more loose bunch of states (like Nv RC or tex_env) or “exotic” Ati-style interfaces)
  • synchronisation (NV_fence & NV_occclusion_query: seems like beginning of a another mess)
  • unified object interface (one set of glGen+glDelete+glIs+glPrioritize is enough)

It is not that important to me which shading language(s) will be available with above interface.
Provided you stick to unified interface (like attribute binding, loading constants), you might
implement any shading language, high or low level:

GL2 shaders, Cg shaders, DirectX 8, 8.1, 9 shaders,
Nv VP, Arb VP, Ext VS, Nv Parse, Ati FS, Matrox FS, Nv30 FP,
or even texture_env_combine/crossbar/route

You may put some of the above into core, others into extensions, others into glu, whatever.
But changes proposed in OGL2 are necessary. Retarding them for any reason (while DX lives free from above problems) is harmful for OpenGl. So I’m very disappointed to see Nvidia officially declaring “cold war” to OGL2.

As for the article at extremetech, it is funny.
In beginning it tries to suggest that the main obstacle to see GF3 & R200 utilised in full extent is the complexity of asm-style programing.
So it is not because most people have DX7 HW, nor because DX8 HW is priced over 200$.
It is all because programmers are not smart enough to write 12-line asm programs

[This message has been edited by Carmacksutra (edited 06-21-2002).]

There are things in OpenGL that could use fixing, but in many cases I disagree with the approach taken in the proposals, and in others it’s not clear that it’s really worth fixing them.

Obvious example that you alluded to: it’s lame how texture enables work in OpenGL, with the precedence and all. But it would be difficult to make any change that both improves the behavior and preserves compatibility.

Some of the things you are talking about will get fixed up simply by standardizing a low-level assembly language for vertices and fragments. (For example, with a fragment program, all the texenv state can just get ignored.)

Others are solely a function of how WGL works, and really can’t be fixed by the ARB at all (wglShareLists).

I do think OpenGL needs a unified object interface, but have issues with the specific proposal.

What I personally (again, not speaking as an NVIDIA employee, but as someone who cares about OpenGL and its future) would like to see would be for everyone to take a step back from what are currently fairly concrete proposals and instead look at design decisions. There has been little dialogue on the really tough, but really important questions: “Does this feature really belong in core OpenGL? How should this feature be exposed in the long run in OpenGL?” Once you have a concrete proposal written, it’s hard to give feedback beyond the level of “I found a typo here” or “this doesn’t make sense” without seeming really crass, i.e., “you should delete this entire section from the proposal.” It’s especially hard when everyone works at different companies and (to put it mildly) not everyone trusts everyone else.

In the short term, I think the best course for OpenGL and the ARB is clear: get OpenGL 1.4 out the door, and then (in perhaps another 6 months) start thinking about fragment programmability. It’s important that standardization processes not be rushed. Rushing things guarantees lots of unhappiness with the outcome. For example, I think it would be downright silly for the ARB to start working on an ARB_fragment_program extension today.

That’s just what you have to live with when it’s a standards body and not Microsoft. Microsoft can talk to each vendor in private and come up with some sort of compromise in advance. The ARB doesn’t work that way. Each way has advantages and disadvantages.

There are interesting meta-discussions to be had here about “how to design a good standards body”.

  • Matt

Hey Matt -

You mentioned that it is currently very difficult, if not impossible, to support a transparent multi-pass compiler.

This is probably a naive question, but why can’t we do away with the need for (most) multi-pass rendering by putting some simple looping capability into the hardware?

Now, let me explain why I think this is reasonable before everyone jumps on me

Don’t the Geforce3 & 4 use a loop-back mechanism to apply 4 textures or 8 register combiner stages? What stops you from generalizing this? Why not allow n loop backs for 2*n texture applications?

Wouldn’t this also work for vertex programs? For example, if I had a 256 instruction vertex program, couldn’t the driver just split it up into 2 128 instruction programs, switch the programs when a vertex reaches the end, and run the vertex through again?

Probably good reasons why this isn’t possible, but I’d be curious to hear them

– Zeno

Originally posted by Zeno:
This is probably a naive question, but why can’t we do away with the need for (most) multi-pass rendering by putting some simple looping capability into the hardware?

I just kindof answered this same question in another thread recently. Let me repost my response:

Pretty much. The difficulty with using a feedback into the pipeline for a second/third/etc pair of textures is that all of these textures need to be accessible. Even texturing from video memory is painfully slow. Graphics hardware use a texture cache to buffer small parts of of a texture for extremely fast access. When you do a feedback loop, you either need to
A) Every pixel, load the first texture set, draw, flush the texture cache & load the next texture, draw, repeat.
B) Have a mechanism for sharing the cache among multiple textures
C) Have separate caches for each texture.

The problem with A is that its a performance waster. You waste tons of bandwidth loading and unloading texture from vidmem to cache. On top of that, you will have excessive idle time while waiting for the textures to load, unless you create some type of batching system in the hardware (ie: process 100 pixels partway, then switch texture and continue). Even then its still a memory access hog.

The problem with B is that you then only have half the cache available for each texture stage (or if you want to loopback more than once, you only get 1/3, 1/4, etc the ammount of cache available).

C is about the best option, but I think by the time you get that far in the hardware, I think you probably a large portion of the way to just making those fully separate texture units.

Perhaps matt, working a little bit closer to the hardware than I do, can confirm my above thoughts on the matter.

SGI once designed hardware with the kind of ‘recirculation’ described. They didn’t fully implement it in the end. I wouldn’t be surprised if someone else does it. It would be easy to handle an arbitrarily large multitexture scheme with vanilla daisy chaining, but it would be more difficult to make that work with newer crossbar combiner style of texenv (or your equivalent) which is essential to make it work. The way stuff can be combined together you’d probably long for more registers if you were using massively multitexture shaders, but it’s max textures seems to be the problem for now. Yes at some point you’d totally thrash your cache, hopefully you have a bigger cache and many more registers on that hardware, worst case performance drops significantly at some point.

[This message has been edited by dorbie (edited 06-21-2002).]

The short answer to your question is: higher resource limits cost $$$, and no matter what you still have a finite resource limit.

There’s all the difference in the world between being able to perform a “large” number of operations per pass and an “unlimited” number per pass.

Obviously a truly “unlimited” number is impossible because computers are finite. Even a software implementation will always have some limit. So the API needs to have a way to say “no!” at some point on the grounds that “this program is too big for resource X”. (The ARB_v_p working group discussed this problem at great length…)

Okay, so you might say, “yeah, eventually that might happen, but surely you could make the limits high enough that no one will ever really hit them?” I think this evades the question (think the infamous 640KB), but, for the sake of argument, I’ll pretend to concede this point.

Let’s make the problem easier by assuming that there is no branching at all. Branching/looping only makes it harder.

Let’s also assume that each vertex and each fragment is fully independent, i.e., they don’t write to any state that affects the others. One way you could break this assumption would be to let programs write to a “constant” register, sort of like how vertex state programs do in NV_vertex_program. Another would be to let fragment programs do programmable blending or the like inside the program by allowing you to get the current pixel’s color or Z or stencil as a fragment input. Again, this stuff makes things harder, so let’s drop it for now.

You can draw a dataflow graph of any shader. Many nodes are simple math operations. There are also certain “special” nodes. For example, an “interpolator” node converts a per-vertex quantity into a per-fragment quantity. And then there are the input and output nodes. Input nodes correspond to vertex attributes, and output nodes correspond to the final shaded color (and possibly Z) of a pixel. Other nodes contain constants used by the program. You would probably use a special node to do relative addressing. Another would probably be a ‘position’ node that indicates the need for rasterization.

One very special node is a texturing node. It would take a texture coordinate as input, refer to a specific texture, and do a lookup.

With the right set of nodes like this, we can now draw every shader as a DAG.

Output nodes do not constitute a resource; there are a fixed number of them possible.

Input nodes are definitely a resource that is tough, if not impossible, to virtualize. Adding more vertex attribs puts more burden on the driver and really boosts the size of RAMs in the hardware.

Interpolators are not a problem math-wise, because you can reuse a single interpolator as many times as you want, but it is still necessary to store the computed vertex somewhere for rasterization and interpolation. This is likely a fixed-size RAM.

Math operations can easily be looped, but you need to have enough temporary registers. Temporaries can get very costly – big multiported RAMs.

Textures all need to fit in RAM – bind too many textures and you’re in trouble. You also only have so many API slots for binding textures, and adding more slots may lead to other assorted HW costs.

It depends to some extent, but you probably can build something that can handle “a lot” of instructions (total DAG nodes in this framework) and constants without too much trouble.

There are two big approaches to splitting things up: spilling and breaking up the DAG into workable pieces.

There’s plenty of space for spills – video memory, for example. But hardware that can spill extra (e.g.) temporaries to video memory could get rather complicated and slow. Spills also only deal with restrictions on temporaries, in general.

Breaking up the DAG can work also, though you can construct some rather degenerate DAGs where this falls apart. Again, you need temporary storage off-chip to store results from each “pass”. F-buffers are problematic because they have unbounded size. Anything that relies on “1 fragment per pixel”, like using a pbuffer, breaks when the wrong depth test/stencil/etc. modes are used. In practice you could probably use an F-buffer and flush it whenever it fills up, but this would be worse in performance than letting the app do its own multipass, and I think you can construct degenerate cases where you will get exponential (well, at least greatly superlinear, not sure if it is really exponential) blowup of runtime.

So is it impossible to do all this stuff? No, not impossible. Is it a good use of fixed hardware resources? Definitely not. These hardware resources are better spent putting more math units in the chip. And even then, say you did build enough stuff to support “absurdly big” shaders. Then old hardware would still have little choice but to fall back to software, which really isn’t what you want, and you still will hit that limit somewhere and need the API to reject certain sets of programs.

Once you recognize that the API needs to be able to reject programs, the nature of the problem domain changes. Now you can ask the real questions here, like “how many temporary registers are really useful? Does anyone really want 256 4f registers?”

If you truly want unlimited flexibility, you might as well be using a CPU. Graphics is all about making the common case fast.

  • Matt

Nobody is dumb enough to really suggest unlimited texture. When they say unlimited (nobody actually wrote that) they mean considerably more that ohh… 4 for example. Straw man implementations of unbridled complexity don’t mean that some number like 32, 64 or 128 wouldn’t be interesting. It’s actually simple to imagine a situation where each few textures (or even just a few interpolated parameters, same textures more combiners for most params) buys you a full bump mapped shadowed fragment light source in a single pass, just as a for instance. Recirculation just becomes one interesting way to implement this.

[This message has been edited by dorbie (edited 06-21-2002).]

dorbie,

Yep. I agree. 4 is definitely not enough for the long run, and obviously there’s going to be a limit somewhere no matter what. But the developer may want to write shaders without regard to hardware limits (and still get “good” performance even over those limits), and that’s certainly a valid desire.

  • Matt

Dorbie & LordKronos:

I see the problem with cache thrashing…I should have thought of that.

Matt:

As Dorbie mentioned, you made a bit of a straw man out of my suggestion by assuming that I meant ‘n’ had to be able to go to infinity. I realize that there are limits that are going to come into play somewhere…whether it’s not enough RAM or the shader is just too slow for it’s intended purpose.

Also, I meant to suggest loop-back only in the parts of the pipeline that are currently programmable, so I didn’t mean to suggest that an interpolater needs to loop back as well (this is currently in a “hidden” part of the pipeline). That wouldn’t really be useful anyway since it’s input is per vertex and it’s output per pixel.

Anyway, tell me if I understood the rest correctly:

Spills. I sorta see what you’re saying here. Some features, like stenciling, currently require multi-pass rendering simply because the stencil buffer must be complete before it can be used to affect further rendering.

Similarly, if you want to be able to stream vertices through the program and not do one vertex at a time, you have to have somewhere to store a batch of vertices between application of the first and second programs. I can definitely see this being a pain in the butt, especially since you don’t know how many vertices are coming.

Thanks for the detailed answers

– Zeno

On the subject of HLL vs ASM - I think making a low-level language part of the core API is wrong.

  1. It may not map well to all hardware. Imagine a TTA architecture, where there’s only one instruction - MOV. The output of one unit is fed directly as input into another unit. The unit may perform multiplication or texture read, just work as a temporary register or may change the instruction pointer if the passed parameter is 0. The current register based low-level shaders will not work very well with that. This is just an example.
  2. It may not scale well with future hardware. PS1.0 are not as optimal as PS1.4. on ATI hardware.
  3. Transparent handling of multipass. I am sure some vendors will have it implemented in their drivers and this is a killer feature.

And the shaders are not the most important thing. The current state of OpenGL is a mess, OpenGL 2 is the only way out of the issues with the synchronisation, render-to-texture, vertex arrays… If somebody knows of any other plans to handle these problems in a clean, platform independent way - please tell me. Just because some people may not like how textures work in OpenGL 2.0 (what is wrong with that anyway?) is a lame excuse not to provide all that goodness in the hands of the programmers. If there is a platform independent way to do something that most ISV and IHV agree on (I think this is the case with OpenGL 2.0) - it MUST be implemented, even if you don’t agree with it.

Originally posted by mcraighead:
That’s just what you have to live with when it’s a standards body and not Microsoft. Microsoft can talk to each vendor in private and come up with some sort of compromise in advance. The ARB doesn’t work that way.

I think I’m right - NVidia aims to be the Microsoft of OpenGL. It may be a layer on top of opengl, but it’s a layer that does all the difficult stuff. Stuff outside this layer (in OpenGL) would simply point the state machine at array bases, and call the compiled shaders. If it becomes popular with developers (and why shouldn’t it), then NVidia will have effective control over the core features of OpenGL. Maybe this will be a good thing - ATI, Matrox and most definitely 3dlabs won’t agree, though.
Come back SG, all is forgiven!

Originally posted by mcraighead:
Obvious example that you alluded to: it’s lame how texture enables work in OpenGL, with the precedence and all. But it would be difficult to make any change that both improves the behavior and preserves compatibility.
Maybe this is too maximalistic approach?
I dont think it would be useful in practice to mix old and new commands when both control the same part of GL machine (eg. texture binding).

I think texture binding is not hopeless case to design reasonable interoperablity of 1.x and 2.0 objects (to say short: introduce new target, having highest priority for 1.x commands, and being invisible for 2.0 commands)

In case of any really tough conflict, the simplest (and the best IMHO) way would be making complete disjoint of particular group of states between 1.x and 2.0. So 1.x commands work with 1.x states, and 2.0 commands work with 2.0 states.

Of course this would not allow using any 1.x tex_env nor NvRC with 2.0 texture objects. But this is just one more reason to include textual versions of tex_env, NvRC, AtiFS into 2.0 shaders. In general, i think legacy functionality should be upgraded to fit into 2.0, not reverse (cripple OGL2 to fit legacy)

Others are solely a function of how WGL works, and really can’t be fixed by the ARB at all (wglShareLists).
I was meaning that with 2.0 buffer-objects the problems with mentioned wgl*** (when used with RTT) would not exist.

I was thinking of how Cg is suppose to be supported and about the benifits of vendor specific extensions.

Is Cg suppose to take over this whole shading language mess. The Cg website says if something is not supported, then it bypasses or something. Cg will have to be up to date all the time.

The nice thing about extensions was that each vendor showed one way of doing things. Maybe we, the developers, should vote for which ones better (for extensions that show similarities of course).

V-man

Matt yes, clearly to free implementations ultimately from the shackles of a fixed pipeline you need more stores to pass data between passes which hopefully translates to fewer restrictions (or actually greater performance) on compiled code of arbitrary complexity. The ultimate difference between this and a recirculation scheme is that the multipass scheme might remain fragment cache coherent reguardless of complexity, but to be equivalent will require much larger framebuffers to store more data over the full frame, recirculation might thrash cache but requires less framebuffer memory although it would need significantly more registers. That seems to be the real tradeoff (ignoring geometry for now).

[This message has been edited by dorbie (edited 06-22-2002).]

Is there some reason why we can’t have a framebuffer dotproduct blending mode? And MAX/MIN logic? You know, with colour compressed vectors, etc.?