ARB June Meeting

Now the issue of each vendor having their own extensions to expose features common to only their hardware will always be there. Getting away from writing code specific for a particular card will never go away. Now most of it should, the basic core stuff, like the deal with ARB_vp and ARB_fp is great. It’s a standard and all cards, if they want to be worth a poo, will support them. But still having a few extra vendor specific extensions is not a bad thing. It’s just that whole entire code paths should not be vendor specific, like how things had to be done in Doom 3. This kind of thing even goes on in the CPU world (I don’t mean the entire code path thing). We have Intel and AMD fighting each other just like we have NVIDIA and ATI. Like at one point we had these AMD chips with 3dnow. Oh boy, stuff to help applications run faster and better if you were running an AMD chip, while the people on the Intel would just be running through the standard path. Dang. But don’t despair just yet, Intel had their own fancy “extensions” like MMX, SSE, and the like. So here we have two CPUs with the basic instruction sets but at the same time each had their own enhancements for those who wanted to use them for extra performance to make their program better than the rest. This is just like it is right now with GPUs. And this kind of trend I do not see dying off any time soon. It’s actually a good thing in many ways.

-SirKnight

Originally posted by Korval:
and I really don’t like the idea of ‘texture’ accesses in vertex shaders

hm… why? you don’t like displacement mapping? or other features? you can have a full lodded terrain and just map your texture on it and voilà, always good heightmapped…

or hw animated water texture can get used for the displacement, too…

tons of features actually!

and for hw vendors it will be more useful for the future, too… why? because in the end, vs and ps should and so could be implemented the very same way… that could mean for example in a fillrate intensive situation, you could use 6 of your 8 pipelines for the pixelshading, and 2 for the vertexshading… in a vertexintensive situation you could use 5 for vs, 3 for ps…

and, if you want to use hw to assist raytracing for example, you could use all 8 for raytracing…

this could come, the pipeline-“sharing”… and could be quite useful for gaining performance… reusing resources that is

just think of it… all your vertexshaders would just go and support the pixelshaders…

woooooooohhhhh now THAT would rock

Well you have fewer interpolators than texture units so what’s your option?

Anyhoo, to correct your Intel ananlogy, Intel won’t be adding 3DNow! instruction support to their compiler any time soon. The analogy is strained anyway because other vendors could optimize if they chose to (or at least had a choice once), they won’t of course, because they’re firmly in a different shader camp. Reasoning by analogy is rarely very useful.

I know you have ARB_fp etc, but that doesn’t mean all hardware does the same thing, or has 100% instruction match (esp into the future) nor the same optimal program length or register use or texture unit count. This is all pretty obvious.

You have to be pretty blinkered to think that compiler multiple choice between partial hardware coverage or proprietary half-hearted support is a good thing.

IF that’s how things pan out.

SirKnight, this shader extension is NOT in the core, it’s an optional ARB extension, if never got enough votes in the second vote to put it in the core.

This probably puts things in a different light for you.

Originally posted by SirKnight:
I don’t think it makes much sense to have a HLSL built into OpenGL. There should only be assembly language like shading extensions (like ARB_vp and ARB_fp) in the core and the HLSL should be “outside” like how Cg is. To me this seems like an obvious thing to do but some don’t see it that way. OpenGL should be kept a “low level” graphics API and anything else you need, any kind of helpers like HLSLs, should be just like utilities outside the api and compile to what is in the core.
NVIDIA’s company line. Gosh. Allow me to disagree (as I do all the time now, it seems)

Nailing down the ‘assembly’ language is bad. If you require a certain assembly interface for the high level, layered compiler to work, you restrict hardware implementations to that exact assembly interface. Hardware is too diverse to do that. Much more than in CPU land.

Remember ATI_fragment_shader vs NV_register_combiners? You’d need one of these to make full use of the NV2x and R200 generation. What you’re proposing is somewhat akin to restricting yourself to ARB_texture_env_combine. You gain portability but lose flexibility on both targets.

One of the very reasons for high level languages is the opportunity to eliminate diverse middle interfaces, to say goodbye to multiple codepaths, and still get the best possible hardware utilization.

This is why, IMO, the assembly style interface should best be hidden and never even be considered for exposure again.

I’m no expert on Direct3D but from what I have seen, that’s pretty much how D3D is. All of these extra things are a part of the D3DX library of helpers.
And the DX Graphics model didn’t work out too well. Futuremark, anyone? MS subsequently did a new PS2_a profile. Guess why that just had to happen …

Having a HLSL built in the core makes about as much sense as having C++ built into our CPUs. No, what we have is an assembly language defined for our processors which has a 1:1 mapping to it’s machine code instructions <…>
Yeah, right. Try running an x86 executable on a Mac. Then come back and try a more appropriate analogy for the point you wish to make. Sheesh.

Meanwhile, I’ll take your analogy and use it for my own POV:
C++ can be compiled for x86, for PowerPC, for Sparc. If the code in question doesn’t touch upon OS peculiarities, it’s all a matter of selecting the right compiler target.
You don’t compile C++ to x86 ASM, and then try and do a second compile step to produce a Mac binary.

There is no industry standard assembly representation that’d do justice to all targets. Everyone who tells you otherwise must have been smoking something hallucinogenic.

you don’t like displacement mapping?

What displacement mapping?

Real displacement mapping involves shifting the location of objects per-fragment. Doing it per-vertex is nothing more than some kind of hack.

In any case, vertex shaders can’t do the really hard part of displacement mapping anyway: the tesselation. And, if they’re smart, it never will (tesselation should go into a 3rd kind of program that feeds vertices to a vertex program, so that they can run async). So, in order to do automatic displacement mapping, you still have to do a render-to-vertex-array to tesselate it. Since you’re writing vertex data from your fragment program, you may as well use its texturing facilities to do the displacement.

Now, I do like the idea of binding arbiturary memory to a vertex shader. However, this is different from a texture access.

If you have a 16x1 texture, accessing the texel value 3.5 has some meaning. With bilinear filtering, that means accessing the blend of 1/2 of pixel 3 and 1/2 of pixel 4.

This has absolutely no meaning for the kind of memory I’m talking about. For example, let’s say I bind a buffer of memory that contains matrices for skinning to a vertex shader. The way this should work is that it only takes integer values as arguments. Matrix 3.5 has no meaning. And a blend of the 16-float values that matrix 3.5 represents would be the absolute wrong thing to do.

Also, textures are not updated frequently. And, when they are, they are usually updated via a render-to-texture or a copy texture function, not from main memory data. However, 9 times out of 10, memory bound to a vertex shader is updated every time the shader is used. So, you don’t want to use the texture accessing functionality with it; instead, you want an API more akin to VBO (you could even use a buffer object for the binding, since the API works so very well for the kinds of things you’ll try to do).

because in the end, vs and ps should and so could be implemented the very same way… that could mean for example in a fillrate intensive situation, you could use 6 of your 8 pipelines for the pixelshading, and 2 for the vertexshading… in a vertexintensive situation you could use 5 for vs, 3 for ps…

Yeah, this makes since. Especially considering how they are on very different ends of the pipeline. And they would be processing data fed from different places. And 1001 other major differences between vertex and fragment shaders that make this a horrible idea from a hardware implementation standpoint.

Plus, for optimal performance, you want to pipeline vertex shaders like a CPU: deep pipelining with a sequence of instructions all being processed at once. For a fragment shader, you want to pipeline like pixel pipes: wide pipelining, with multiple copies of the same instruction being called at the same time. Why?

Because vertex programs must operate sequentially. The setup unit has to get each vertex in turn. It does no good to spit out 2 or 3 vertices at once; indeed, this is incredibly bad for a short vertex shader (shorter than it takes the setup unit to process 2 or 3 verts). Also, it compilcates the setup logic, as it now must somehow know the order of these triangles. Each fragment of a single triangle, however, is completely independent of the others, so it makes since to do them in parallel.

Nailing down the ‘assembly’ language is bad.

Odd. Intel, apparently, thought that this was a very good idea (until recently with IA64, but AMD is taking up the reigns). Allow me to explain.

A CISC chip like most Intel chips works by emulating an instruction set. The P4 reads x86 instructions, converts them (using a program written in what they call ‘microcode’) into native instructions, and then executes those native instructions.

The thought behind this concept is so that you can compile programs to a single assembly language that can be run on multiple generations of a processor. Which is why code compiled for a 286 still runs on a Pentium 4 (to the extent that the OS’es allow it).

If you take the analogy to graphics cards, the card itself would be the underlying generation of hardware. The assembly would represent the x86 instruction set. The microcode is the assembler that runs when you create the program. So, really, SirKnight’s idea is nothing more than modern instruction set

Granted, ARB_vertex_program and ARB_fragment_program are not quite good enough to immortalize as a finalized ISA-equivalent. However, it is hardly fair to say that this idea is bad; after all, it is the basis of why your computer works today (unless you’re not using a PC).

You might say that graphics hardware is evolving faster than CPU’s did. However wanting to stick to the x86 ISA didn’t stop Intel from exposing MMX or SSE instructions; they were simply extensions to the ISA. Unless there is a forseable change that fundamentally alters how the assembly would look (outside of merely adding new opcodes), there isn’t really a problem.

However, there is one really good thing that comes out of glslang being part of drivers: shader linking. Supposedly, you can compile two vertex shaders and link them such that one shader will call functions in the other. In a sense, compiled shaders are like .obj files, and the fully linked program is like a .exe.

Of course, with a rich enough assembly spec (richer than either of the ARB extensions), you could still have this facility, where you would give the driver an array of shaders to compile together. The assembly would have to retain function names in some specified fashion. At that point, granted, nobody will want to write code in the assembly anymore, but that’s OK.

One of the very reasons for high level languages is the opportunity to eliminate diverse middle interfaces, to say goodbye to multiple codepaths, and still get the best possible hardware utilization.

So, why do you support glslang, when it clearly doesn’t offer this (as I have mentioned in other threads)? Outside of that library of your’s that you are writing, which has very little difference from Cg’s low-end profiles, ultimately.

[This message has been edited by Korval (edited 07-27-2003).]

Originally posted by Korval:
For example, let’s say I bind a buffer of memory that contains matrices for skinning to a vertex shader. The way this should work is that it only takes integer values as arguments. Matrix 3.5 has no meaning. And a blend of the 16-float values that matrix 3.5 represents would be the absolute wrong thing to do.

Also, textures are not updated frequently. And, when they are, they are usually updated via a render-to- texture or a copy texture function, not from main memory data. However, 9 times out of 10, memory bound to a vertex shader is updated every time the shader is used. So, you don’t want to use the texture accessing functionality with it; instead, you want an API more akin to VBO (you could even use a buffer object for the binding, since the API works so very well for the kinds of things you’ll try to do).

Your skinning example supports validity of GL2’s multi-index-array concept (criticized, and now abandoned). You could have 2 index arrays: one for vertices, and one for matrices, effectively sharing each matrix between group of vertices. Could be great for batching (as GDC’03 document states, we may expect importance of batching to increase with each new HW generation), and more powerful than packing the matrices into spare constant regs.

I’m with those who are waiting for süperb buffers. IMO it should be prioritized even over GLslang.

Yeah, right. Try running an x86 executable on a Mac. Then come back and try a more appropriate analogy for the point you wish to make. Sheesh.

Running an x86 exe on a Mac? What the hell are you talking about? I never said that a program written in some HL language like C++ would run on ANY CPU. Show me where I said that. I’d love to know.

I don’t think you understood a word I said. You are saying that I said things I did not.

You don’t compile C++ to x86 ASM, and then try and do a second compile step to produce a Mac binary.

Well duh! I never said you can do that! But, if you are making a program to run on an x86 cpu, then when you compile it does compile to x86 assmebly then from there that converts to the x86 machine code to be executed. The whole CPU thing was just kind of a base, I never meant to say that how CPUs and GPUs are programmed are exactly the same in all ways and a program written in a HL language will magically work on every cpu in the world from just one compile.

There is no industry standard assembly representation that’d do justice to all targets. Everyone who tells you otherwise must have been smoking something hallucinogenic.

Nor did I say there were. Please stop saying I said things I did not and trying to put words in my mouth.

You know…I probably just assumed this but I guess I should have stated when I was talking about CPUs, I was mainly thinking about one single architecture. Like I said, that was just a base on what I was talking about, I never meant to cover everything about all CPUs all at once. I just wanted to show how a program goes from a HLL to an executable form on a CPU to show why I like Cg’s way of a HLSL on a GPU and why I think that way is better in my opinion.

And again yes Korval is correct. He said my idea is nothing more than a modern instruction set. BINGO! This is what I was getting at. I’m glad someone understood all of that. Korval wins the gold star!

Ya, the instruction sets we have now, ARB_vp and ARB_fp is not up to par yet for a standard modern instruction set, but it’s a good start and obviously as GPU’s get better, so will their instruction sets.

Now I would like to mention that the idea of a HLSL built into OpenGL is not stupid, it does have it’s good points and I understand these, it’s just not what I prefer.

-SirKnight

Originally posted by dorbie:
[b]SirKnight, this shader extension is NOT in the core, it’s an optional ARB extension, if never got enough votes in the second vote to put it in the core.

This probably puts things in a different light for you.[/b]

Hm…you know, I could have SWORN I read it was put into the core. I see now that I read over it again that it wasn’t. Ok then nix that whole thing I said about it being in the core. Sorry about that, you’re right dorbie. Doh!

-SirKnight

Oh my. What have I done?
Code compatibility on x86 has historical reasons, I understand that. I wish to make the point that these reasons substantiate in the form of ‘legacy’ code, code which has already been compiled down to the target.

I really like x86 a lot. It’s a wonderful, powerful and expressive ISA, though today it’s mainly some form of transparent code compression. But … you can’t deny that the required chip complexity to support this sort of legacy translation is overwhelming. Spot the execution units , if you can.

“Legacy code” and, along with it, the reasons for all this complexity can only be produced if the ISA is exposed and code is shipped precompiled. Because I prefer more execution units and flexibility over scheduling logic and rename buffers, I think the ISA should be tucked away somewhere.

This issue is all the more important, because in contrast to x86 implementations that have just two instruction paths (int and fp), graphics chips have already hit eight parallel execution pipes. Independent branch control logic for eight parallel, first class citizen OOO schedulers would surely be a major pita. By hiding the ISA, and shipping only high level code, this complexity can neatly be moved to software.

Just like I can create decent x86 and PowerPC code from a single C++ source. I don’t want to and I don’t need to know the ISA, if you follow me.

The x86 evolution is not necessarily a role model for programmable graphics hardware. I believe it shouldn’t. That’s all.

This issue is all the more important, because in contrast to x86 implementations that have just two instruction paths (int and fp), graphics chips have already hit eight parallel execution pipes. Independent branch control logic for eight parallel, first class citizen OOO schedulers would surely be a major pita. By hiding the ISA, and shipping only high level code, this complexity can neatly be moved to software.

Remember, the equivalent of the microcode opcode translator on x86 chips is the driver’s compiler for the assembly language. So, the complexity for scheduling and so forth is in the software, not hardware. Also, while you may think that it is complex to write an optimizing assembler for the assembly language, it is more complex to write an optimizing C-compiler for the assembler.

Oh, and, outside of discard actually stopping a pipe (which I seriously doubt will ever happen), why do the eight parallel pipes need to have independent branch control logic? They certainly don’t today, and they probably aren’t going to in the near future.

The x86 evolution is not necessarily a role model for programmable graphics hardware.

There are alternatives (glslang), but this kind of model is quite viable on its own. And, it gets the C-compiler out of the drivers.

Originally posted by zeckensack:
There is no industry standard assembly representation that’d do justice to all targets. Everyone who tells you otherwise must have been smoking something hallucinogenic.

GCC uses two passes. The front end compiles language X into “gcc assembly” and the back end compiles that into assembly for platform Y and does the low level optimizations. This industry standard assembly language works quite well.

When you develop your Java application you use a HL language which gets compiled into byte code “assembly” which the platform can either run, compile, or interpret.

MS’s C# also works this way. Visual Basic too?

Anyway GCC does a great job doing justice to all targets with its internal industry standard assembly.

This style allows you to have CG, GLslang, renderman, standford shader, and even plain C (I belive the code play guys took at look at Gg and its vertex programs and what they were doing for the PS2’s vertex shaders and asked why invent a whole new language, you can use C just fine for your vertex programs) which output an intermediate generic assembly which the driver can then optimize. Why tie ourselves to GLslang or cg? What if scheme is the ideal shader language?

I see no reason somebody can’t write a CG and GLslang front end for gcc and a vertex/fragment_program backend and get rid of the need to have the compiler in the driver.

Originally posted by Korval:
Remember, the equivalent of the microcode opcode translator on x86 chips is the driver’s compiler for the assembly language.
The keyword here is “the assembly language”. There is no single agreed upon internal instruction format in x86 land. Exposing the internal ROPs, µOPs, whatever of current processors would only create new backwards compatibility nightmares. Opened assembly backends to high level compilers are comparable. As soon as you allow people to program to a low level ISA you’re obliged to keep compatibility.

So, the complexity for scheduling and so forth is in the software, not hardware. Also, while you may think that it is complex to write an optimizing assembler for the assembly language, it is more complex to write an optimizing C-compiler for the assembler.[/b]
Both are non-trivial tasks. Full blown high level compilers are more complex than cross-assemblers, I can agree with that. At the same time, they maintain more opportunities for hardware evolution. We’ll get to that in a second.

Oh, and, outside of discard actually stopping a pipe (which I seriously doubt will ever happen), why do the eight parallel pipes need to have independent branch control logic? They certainly don’t today, and they probably aren’t going to in the near future.
If dynamic branching ever becomes important, this will become interesting. Consider a fragment shader with a dynamic branch. Parallel pipes can go different ways through this branch, different loop iteration counts etc, so you either need to synch it all somewhere, or you need multiple control units (if you want to be efficient).

I strongly favor predication for graphics stuff, but there are different solutions to the issue (eg sorta like split f-buffers, suspending execution at the branch, spilling temporaries to two parts of the buffer; emptying both buffer regions applying their respective taken/not taken branch code). The truth is not out there yet.

However, if any one of these mechanisms is chosen, it affects the ISA definition, gets exposed, and creates the compatibility issue. If ‘the industry’ goes predication, it’ll make sense to expose predicate registers and predicated execution (similar to x86 flags and CMOV but more sophisticated) in the assembly interface. Otherwise a layered compiler couldn’t optimize for the hardware, or even couldn’t support branches at all.

If we go with ‘real’ branches, there needs to be a JMP instruction, a conditional jump and condition flags (btw, how many of them?).

Either way, hardware implementations must somehow support whatever the standard middle layer is. If there’s a new idea (or simply more resources) in any one IHV’s hardware, it cannot go into the middle layer for compatibility reasons. Just like you can’t fully use a Geforce 3’s fragment processing w PS1.3. This, I think, is a major drawback that should be avoided. It’s still possible to avoid it.

Extending the middle layer will only work if all IHVs agree upon the improvement. This is DX Graphics turf, and it simply can’t do justice to everyone’s hardware simultaneously.

Originally posted by titan:
Anyway GCC does a great job doing justice to all targets with its internal industry standard assembly.
I love GCC, mostly because it occasionally beats the crap out of MSVC6, but this is simply not true.

If you want optimum performance on an Intel processor, you get ICC, period. GCC can’t compete, and I even think I’ve read on the mailing list that this internal “everyone’s equal here” is the root cause.

What we’re seeing with GCC vs ICC is an example of an IHV taking the responsibility to show off their own product. They know it best. They can optimize best for it. And most importantly: they’re the only ones with a real motivation.

But one further question: do you know the internal GCC representation? Can you code directly in this representation? Will it affect your code if that internal representation gets changed? Three times no, probably. GCC’s internals are free to evolve as needed because they are not exposed to users.

I belive the code play guys took at look at Gg and its vertex programs and what they were doing for the PS2’s vertex shaders and asked why invent a whole new language, you can use C just fine for your vertex programs

That’s because the PS2’s vector units are mini-CPU’s. They have memory. They have branching. They have all the facilities that C expects to be present.

Vertex programs may never have the facilities that C expects. Remember, the PS2’s VU’s also function as command processors (deciding, not just what to do with the given vertex data, but actually walking the vertex data lists); they need these facilities to even function. Vertex programs don’t have to perform these operations, and, as far as I’m concerned, never should.

In any case, C is a reasonable solution to VU’s. It’s not for programmable graphics hardware.

The keyword here is “the assembly language”. There is no single agreed upon internal instruction format in x86 land.

You don’t seem to understand. The assembly extensions we are debating would be akin to the x86 instruction set itself. When the driver is given this assembly, it compiles it into native opcodes. As such, there is a “single agreed upon internal instruction format in x86 land,” it’s called x86 assembly.

Parallel pipes can go different ways through this branch, different loop iteration counts etc, so you either need to synch it all somewhere, or you need multiple control units (if you want to be efficient).

Ew. Given these fundamental hardware problems (which I had not realized until now), maybe we won’t be getting branches in fragment programs for a while. I had been expecting this generation, but I now I won’t be upset to have this pushed back for a generation or 2.

However, if any one of these mechanisms is chosen, it affects the ISA definition, gets exposed, and creates the compatibility issue.

There’s the fundamental question: why?

What is it about C that allows for these optimizations transparently that an assembly language would not allow for? Also, why is it that these facilities that allow for the transparent optimizations cannot be given to the assembly as well as a C-like system? Remember, the assembly doesn’t have to closely resemble the final hardware data; it can have facilities that don’t look much like common assembly.

If you want optimum performance on an Intel processor, you get ICC, period. GCC can’t compete, and I even think I’ve read on the mailing list that this internal “everyone’s equal here” is the root cause.

First of all, Intel is in the best possible position to optimize code for their processors; for all we know, they may be sitting on some documents that would help GCC and VC++ compile better for Intel chips.

Secondly, it is highly unlikely that GCC’s notion of a middle-layer is what is slowing GCC down, compared to Intel. More likely, it is a fundamental lack of detailed knowledge of the architecture of the Pentium processor required to produce extremely optimized code.

Thridly, GCC does a pretty good job.

But one further question: do you know the internal GCC representation? Can you code directly in this representation? Will it affect your code if that internal representation gets changed? Three times no, probably. GCC’s internals are free to evolve as needed because they are not exposed to users.

I’m sure somebody knows GCC’s internal representation. Everybody doing ports of GCC has to know, so it must be documented somwhere.

The format we are propsing would not be modified in a destructive way. That is, it would never remove functionality. Nor would it create alternatives to existing opcodes. When the format needs to be changed, it will be modified by adding new opcodes that do something completely different. Otherwise, it is up to the assembler to decide what to do with a given bit of code.

That is why it is important to pick a good assembly representation. Your concern is that you think that we can’t pick a good one. I believe that we can, if we consider the possibilites carefully. If they had been working on this rather than glslang for the time it has been around, they would have worked all of the bugs out of the system, and there would be no need to be concerned.

Let’s look at the benifits of an assembly-based approach:

  1. Freedom of high-level language. We aren’t bound to glslang. If we, for whatever reason, don’t like it, we can use alternatives.

  2. Ability to write in the assembly itself.

The only benifit that glslang has is a potential one. It guarentees that you get optimal compiling from the high-level language. However, the assembly approach does not preclude this either. So really, as long as the assembly approach produces optimal hardware instructions, it is fundamentally superior to the glslang approach.

Originally posted by Korval:
[b]
Let’s look at the benifits of an assembly-based approach:

  1. Freedom of high-level language. We aren’t bound to glslang. If we, for whatever reason, don’t like it, we can use alternatives.
  1. Ability to write in the assembly itself.

The only benifit that glslang has is a potential one. It guarentees that you get optimal compiling from the high-level language. However, the assembly approach does not preclude this either. So really, as long as the assembly approach produces optimal hardware instructions, it is fundamentally superior to the glslang approach.[/b]

If all of this was true we’d all be programming in assembly instead of high-level languages.

A low-level vendor/platform-specific assembly interface is more than enough for control and optimization freaks, and you could still use CG or whatever.

If you’re thinking about a general ISA for GPU’s and assuming it’ll be future-proof then you must realize that for that to become a reality you wouldn’t call it an ISA but a high-level language with complicated syntax.
I believe this is the case with gcc/msvc intermediate code, and java bytecodes.

The main problem here is to create a common interface that will last. And the better way to achieve this is using high-level languages and letting the drivers do whatever they want with it.

If all of this was true we’d all be programming in assembly instead of high-level languages.

You, clearly, do not understand the purpose of this discussion.

I’m not suggesting that everyone be forced to program assembly. What we are suggesting is that a single ISA-equivalent exist that off-line (ie, not in drivers) compilers can compile to as a target. That way, if you don’t like the glslang language, for whatever reason, you may freely use Cg, or something you create yourself.

If you’re thinking about a general ISA for GPU’s and assuming it’ll be future-proof then you must realize that for that to become a reality you wouldn’t call it an ISA but a high-level language with complicated syntax.

Any evidence of this? It’s easy enough to make a claim like this; do you have any actual facts to back it up?

The main problem here is to create a common interface that will last.

Which is precisely what the ARB could have been doing instead of debating features for glslang.

And the better way to achieve this is using high-level languages and letting the drivers do whatever they want with it.

Once again, you make these claims without any actual facts backing them up. I could just as easily retort, “No, it isn’t. The best way is to have off-line compiler compile to a common assembly-esque language.” But, of course, that isn’t a real argument; it’s a shouting match.

You, clearly, do not understand the purpose of this discussion.

Yes I do… I just happen to have a different opinion. I may not have experience with shaders but I do know what they are and how they work. Specially the number of extensions and languages that crop up these last years.

I’m not suggesting that everyone be forced to program assembly. What we are suggesting is that a single ISA-equivalent exist that off-line (ie, not in drivers) compilers can compile to as a target. That way, if you don’t like the glslang language, for whatever reason, you may freely use Cg, or something you create yourself.

But you said this about it:

So really, as long as the assembly approach produces optimal hardware instructions, it is fundamentally superior to the glslang approach.

Which may be true in current generation. Do you know exactly what the future will bring us? Isn’t it better to leave optimal hardware instructions in its own specific extension? As long as we have a general-purpose language I can’t see the problem in that.

[quote]If you’re thinking about a general ISA for GPU’s and assuming it’ll be future-proof then you must realize that for that to become a reality you wouldn’t call it an ISA but a high-level language with complicated syntax.

Any evidence of this? It’s easy enough to make a claim like this; do you have any actual facts to back it up?
[/QUOTE]

We are discussing the future… Do you have facts from the future?
Just see how many “assembly languages” and extensions were created since dx8. They had many things in common but a general assembly language was too difficult to achieve. Wasn’t CG created among other things to overcome these problems? And this is just starting to evolve!

Once again, you make these claims without any actual facts backing them up. I could just as easily retort, “No, it isn’t. The best way is to have off-line compiler compile to a common assembly-esque language.” But, of course, that isn’t a real argument; it’s a shouting match.

I don’t agree with the x86 as the standard ISA for PC’s that someone posted before. Things changed alot since the 8086 and, as you know, most code from that days won’t run properly in today’s systems and vice-versa. Only the paradigm survived. And high-level code! And we’re talking about processors not gpu’s… A gpu generation lasts one year or two!

If you want to code shaders that run dx10 hardware only then your solution will be fine for the time being. Next year some vendor comes up with a way to speed up some special case and another general-purpose-one-generation-only assembly extension will come up…

Things changed alot since the 8086 and, as you know, most code from that days won’t run properly in today’s systems and vice-versa.

Not quite right - 8086 code will run fine on a Pentium 4; as well, if not better than on the 8086. It might not be the fastest code for the Pentium 4, but it doesn’t need to - as long as it’s faster.
Not only that, but converting that code to a near-optimal format for the Pentium 4 requires far far less work than writting an optimizing C compiler. Most of the changes are simple look-ups!

Next year some vendor comes up with a way to speed up some special case and another general-purpose-one-generation-only assembly extension will come up…

That’s exactly what the drivers do! Do you truly believe that NV30 and R300’s native assembly language is ARB_fp? Surely not! They could run Java bytecode for all you know.

My point is that it’s up to the driver to perform the conversion, and there isn’t necessarily a one-to-one mapping between ARB_fp (or whatever other fp assembly) and the hardware’s native language.

Edit: typos

[This message has been edited by al_bob (edited 07-27-2003).]

Originally posted by al_bob:
Not quite right - 8086 code will run fine on a Pentium 4; as well, if not better than on the 8086. It might not be the fastest code for the Pentium 4, but it doesn’t need to - as long as it’s faster.
Kidding? You’ve just given the archetypical example, where you can more than quadruple (float) throughput by not using assembly. In fact, I just wanted to construct a similar example as a rebuttal for Korval.
If you hand x87 assembly nicely scheduled for a 486DX to a P4, you’ll lose. If you use the same high level code you should have used ten years ago to begin with, on an up to date compiler, you win.
You don’t care about that?

Not only that, but converting that code to a near-optimal format for the Pentium 4 requires far far less work than writting an optimizing C compiler. Most of the changes are simple look-ups!
Uhm. You extract back parallelism from non-SIMD assembly to make SIMD assembly, that’s what you’re suggesting? That’s an optimizing compiler. A second one. Well, maybe you could call it a second compiler pass, but no, you wish to expose this layer to users, don’t you?

This would be all fine and dandy if it were one monolithic thingy. Saves you a parsing step and redundant error checking at a minimum. I’ve already portraied more serious issues, would sure be nice if somebody would answer my concerns. Why does a middle layer need to be defined and exposed?

That’s exactly what the drivers do! Do you truly believe that NV30 and R300’s native assembly language is ARB_fp? Surely not! They could run Java bytecode for all you know.
You know, they could even be MIMD or VLIW … cough NV_rc cough.

My point is that it’s up to the driver to perform the conversion, and there isn’t necessarily a one-to-one mapping between ARB_fp (or whatever other fp assembly) and the hardware’s native language.
Exactly!
Convert the code to the hardware’s native language. The conversion to any intermediate “this isn’t the real thing anyway”-language is completely devoid of any merit.

Java may benefit from this approach because the size of distributed code is a concern. Java also pays a very real performance penalty for it. Just like 486 assembly code incurs a penalty on P4s (despite the P4 spending a whole lot of transistors for legacy support, mind you).

“Traditional” software is distributed precompiled because of several issues I don’t even want to enumerate here, because none of them apply to shader code.

In case anyone overlooked it:
Why do we need to define and expose any sort of middle interface and layer an external compiler on top of that? Where are the benefits vs a monolithic compiler straight from high level to the metal?