Programmability...the double edged sword !!

OpenGL 1.4 will provide a unified instruction set and way of calling it. Right now neither exists. For more capabilities which exploit arbitrary stages, resource, and branching capabilities across multiple platforms you need to hide some of the implementation details. There are all sorts of things which might comprise an instruction in a low level graphics library, for example a LUT might supply an arbitrary function in a single operation, a blend, alphatest or stencil op might be a branch. Do you want to have to implement that LUT using whatever strange hardware is available on 3 or more platforms in an implementation or do you want to just request the function or operation in a higher level language and have the guys at NVIDIA, ATI and 3DLabs implement it as best they see fit on their implementation?

You will get the basics in OpenGL 1.4, but there is a growing need for something more powerful that hides the implementation details.

[This message has been edited by dorbie (edited 05-06-2002).]

Zed: I’m not sure what exactly is going on in BenMark5 again. From experience on machines with broken AGP I would conclude it will keep vertices in AGP memory, instead of in system memory. This might be because of the way BenMark creates vertexbuffers, or because of a DX7 limitation, or because of something in NVidias DX7 drivers.

Assuming this is indeed the case it would make sense that a GL conversion using VAR and video memory would yield 10-15% gain. Also not using VAR I’m sure good things are possible, just using more CPU, but you’re probably not measuring that.

Any COM overhead or anything else CPU related wouldn’t really matter in BenMark, it should be entirely GPU limited. Ofcourse hard to really say anything about those 2 without actually looking at both pieces of code… so well, basically I’m just rambling now

As mentioned, DX runs mostly in kernelmode on 2k/XP, and GL in usermode. This should give GL an advantage when call overhead does become an issue. (Note however, that DX will not do a modeswitch every single call, it will just queue op commands and flush them after a while, but at a cost ofcourse).

Because of this expected cost I once converted a full game app from DX8 to GL using VAR.

Unfortunately I got to pretty much exactly the same performance, even in cases where performance was dominated by thousands of small batches of polygons with a ton of renderstate changes. I’m still not quitte sure why I didn’t get a nice performance boost. It seems like GL should be able to exploit the fact that it’s in usermode to atleast create a decent performance gain.

There is another more abstract but possibly more powerful reason. Everyone knows that the biggest obstacle to fancy new hardware sales is lack of new features used in games, the reason these features aren’t used by games is lack of market penetration by hardware which supports them, this is an obvious catch 22, and games like Doom3 are too rare to break the pattern. Allowing hardware developers to implement new features under a higher level language to accelerate existing code which can be of arbitrary complexity helps break this impasse. Whether it works longer term depends on the design of the API.

So we have this language that is abstracted except you already got me worrying about the memory issues between different types/sizes/speeds. I want to make sure that if I need to use say 129 bits that I am on a 256 bit bus so that I don’t end up with a two read stall versus one.

All I am trying to say is that its nice not to have to worry about these things. But unless every maker (basically nVidia, ATI nad 3dLabs) or 3d graphics chips use an almost identical core/memory bus/memory interface I am really gonna be concerned about the little timing nuances and the like.

As long as everything isn’t the same I will be concerned about the differences.

Why do the differences concern you so much? The days of cycle-counting are coming to an end. If the same program runs in 4 cycles on one card and 5 on another, do you ultimately really care? Sure, maybe if you had access to (propriatery) documents on the hardware and the underlying microcode, you might be able to make the second one run at 3 cycles. Maybe. But, then you have to hand optimize all of the possible hardware arrangements. Not to mention, deal with different API’s that expose different ways to get at similar features. I’m willing to sacrifice 1 cycle on one card (or on all of them) to not have to do that, and be able to spend more time on the actual product.

In another case, maybe the implementation writers are actually competent in constructing their compilers and can optimize their code far better than you. Maybe in the current release, your code runs in 5 cycles, but with a new release, it runs in 4 or 3. Or maybe, just maybe, even if you had the documentation and the microcode access, you wouldn’t be able to beat their compiler because their compiler (and the programmers that wrote it) are competent.

Your fears are grounded in the possibility that the driver writers suck, not in any actual facts. Don’t presume that you are superior to the driver writers. In all likelyhood, their optimized compiled code will beat your attempt at hand-optimized microcode, if for no other reason than that they have direct access to the hardware-building staff.

I completely agree with you Koval…almost.

First I definately don’t think I am superior to ther driver writers. I will give nVidia a lot of credit for their drivers. In my mind it always has and probably always will be one of the biggest selling points for there cards. At times I think they have sacrificed small amounts of speed here and there but all in the name of stability and compliancy. nVidia drivers confrom to the OpenGL Spec to the letter and are stable. Thats all I ever wanted in a driver.

As for caring that something takes 3 cycles on one card and 4 on another. I do care. Lets remember what we optimize with hand coded assembly. We optimize the inner loop, the big function that takes 65 percent of the entire time spent during one frame. That is a the place where a minor 2 cycle chance adds up really fast. I truely don’t know what the hardware will look like in the future, nor will I ever claim to. I’m just saying that say for example. Some instruction such as say DotProduct with the result used in a conditional. Say this causes a stall of 1 cycle. Not bad no biggie except the card next to this one doesn’t cause the 1 cycle stall. This shouldn’t be a big deal unless these couple lines of code are in the middle of the fragment shader. Now lets take an estimate of how many times this 1 cycle will get wasted if we are trying to draw a 250,000 triangles a frame.

Most stalls occur from a branch causing memory thrashing of some kind. The memory system is going to have to be balanced with the chip so well. The most minor of delays inside a function that is called 2 million times adds up really really fast.

Lets just remember something, programmers write code the way they see. The way the processor sees it is a completely different matter altogether.


P.S. I hope I didn’t get people to wriled up with this post. I was just trying to get posts going in a rather slow week. ( :

Originally posted by davepermen:

and intelcompiler made my code slower than vc6! and vc7 is even faster than vc6… so who needs that crappy compiler?


this may be but i had another experience. I wrote a kond of learning vector quantification algorithm which is essentially taking a vector with a dimension X (usually around 100), finding the distanc to N other vectors of dimension X and then move the vector with the nearest vector a little bit towards the first.
I wrote this algorithm with standard C and the I tested the VC6 3dnow/sse intrisincs. depending of the cpu, different algortihms were benchmark. And please take a seat. A P4 2000Mhz took around 12 minutes for a benchmark (C code/sse code), but an AMD 650 took only 4 minutes!! for benchmark(C code/3dnow). On my AMD 1333Mhz I saw a difference from C FPU code to 3dnow code of around 60%.
But then I tested the Intel compilers evaluation version: It was able to bring the FPU code to the speed of my 3dnow code and sse code.

Perhaps my case is not representative but I think you can see that a good compiler can result in a much better performance


Devulon, with what you said in mind, do you also write different code for each type of CPU (i.e. Pentium2/3/4, Athlon, …)? They all have different architectures, you know, so they will make the same code run at different speeds.

Let’s face it: by the time you’ve finished writing optimal code for each CPU (or GPU), a new generation will have already appeared. Life’s too short to program at such a low level these days.

– Tom

The key difference is that most people aren’t running all the different architectures. By that I mean no one is running a 386 or a pentium or even a Pentium Pro. I write for PIII/Athlon. The current level of x86 processors (PIII/P4/Athlon) tend to have very similar memory latency’s and the cache hits/misses are quite predictable. Besides most of the time 99 percent of your code you really don’t care anyway. The graphics cards/chips are all going to be quite different. Or at least I imagine they will be. But the base vertex/shader programs are really important since they get executed so many times. Hardware based fixed function T&L runs pretty much the same speed on all graphics cards, because it is fixed function. Have programability simply adds a layer of slight unpredictability.


no, it doesn’t…

and coding for amd/intel is not the problem… coding for 500mhz and 2gig is more of a problem…

and that is what gpu’s differ


not speed in some instruction…
there is not
one with fast dp3, and one with slow dp3
cause they will loose in all he benches else…

they will be faster or slower than the other ones

oh, and developing for intel/amd means as well developing for different ram-speeds, different cache-sizes, different acess-speeds for harddrives and different agp-speeds for writing/reading and different gpu’s… say you want to optimize something to the last you have to code it for every pc new, or you take some interface, like gl, or dx for the gpu, to do the job for you…

you don’t get more abstraction… you get a nicer interface…

look at the specs, they define the amount of “functions” the language will have… this amount of functions is in fact the amount of instructions such a processor has to support… each of those instructions should be very fast, and, at best, equally fast…

but there is no difference if i write

ADD r0,r1,r2


r0 = r1 + r2

the second is just nicer

its not more abstraction… and the parser is opensource, so you can even look at your "asm"output…

and hey… the whole design of the p10 for example is to make it easy for the developer… the chip does the whole parallelism for you, you don’t even have to care about really… it will take all the resources it can find… no need to care…

and we dont need 100% of the gpu for having good grafics (that isn’t possible anymore on new hardware as you can see if you take a look at VAR-benches on nvidia hardware for example…)

I agree there really is only one way to do a DP3 instruction. I definatly expect that to be the same speed on all systems. I can tell you exactly how addition is done in a microprocessor. Its all the same regardless of who makes it. Anyone who has any understand of processor design can show you the exact layout of transistors for just about every basic operation. The thing that concerns me most is branching and conditionals and the fact that when you put multiple instructions together a dependency is formed. For example say we are doing some simple cartoon shading. Dot the norm with the light vector and use that as a lookup into a 1d texture map. And lets say I want to do 2 dots and 2 lookups. Forgive the psuedo code.

temp1 = normal1 . light;
lookup_texture1[ clamp(temp1) ];
temp2 = normal2 . light;
lookup_texture2[ clamp(temp2) ];

Lets say the dotproduct will take the same time (number of cycles) as the clamp and the lookup. The first lookup gets stalled waiting for the first dot product. Thats unavoidable. What the “compiler”/processor must recognize is that the second dot product does not need to wait for the first clamp. The two are not dependant. What I don’t want is the second dot product waiting for the first clamp which is waiting for the first dot product. A better way to write this would perhaps be.

temp1 = normal1 . light;
temp2 = normal2 . light;
lookup_texture1[ clamp(temp1) ];
lookup_texture2[ clamp(temp2) ];

In this case the use of temp1 doesn’t occur until after the second dot product. Which would help it to avoid stalling. The first lookup will probably still stall slightly waiting for the first dot product. But by the time the first lookup is done the second dot product should be ready to go. Assuming the time as stated above.

Depending on a lot of things the 2 should in the end run the same. But in reality they don’t have too. Using the second form helps the compiler to see that the 2 dots/lookups are completely independant.

Anyone who has ever tried to do a software renderer fully understands the importance of timing within groups of instructions. Everyone knows that division takes a very long time. So don’t do a division and then try to use the result right away. Find something else to do (that isn’t dependant on the division) to fill the time. That is what parallelism is. Being able to do other operations while the division is taking place.

Sorry for the rather vague crap reply here. Its kinda hard to explain all of this. But I am sure you get the idea davepermen.

I don’t know if these kinds of things are going to be an issue or not. Its something as a programmer that I may not have to worry about at all. I can only hope that the chip itself has a really nice decoding unit that properly fills the pipes in the best manner. Heck this could just be a nightmare I am making for myself. Then again maybe not.


you know what?
there are tons of parallel pipelines in the chips, and why are they in?
to take care of such call_it_“irregularities”

means if a texturelookup takes longer than excepted, this only pipeline will stall till it gets its pixel, but while that all the other pixels are continued processing and even if the time for the whole takes then longer, because they are done in parallel the final output is at the same speed (you can see this on geforces and radeons… no texture against several big textures is not really a speed drop, except you add more texturenv-stages, wich is another topic)

same for branches… even with branches, the pixels can be processed independent from eachother, and so the parallelism stays the same…

just take the statement of my ex-girl-friend and true-love:
(other sence, but it works…)

Originally posted by Devulon:
I don’t know if these kinds of things are going to be an issue or not. Its something as a programmer that I may not have to worry about at all. I can only hope that the chip itself has a really nice decoding unit that properly fills the pipes in the best manner. Heck this could just be a nightmare I am making for myself. Then again maybe not.

if you read more about the p10, you’ll see that the whole power of the p10 is that IT DOES ALL THIS JOB FOR YOU

it fills the pipes as good as it can, and you dont have to worry about…

Devulon, most people ARE running different architectures, you have to look at the graphics card in the system, not the processor instruction set for shader compilers. Even chips from the same manufacturers require different code paths.

Maybe I am just sad. I feel like I am loosing a friend. I used to get to do all the fancy math and stuff myself and now the damn video card is stealing that away from me. There was a time when games all had crappy graphics. When a game was good it was because you programmed it that way. The more the videocard does the more I feel like the purity of the do it yourself attitude is being lost. I definately thing video cards are heading in the right direction. I just miss the ways of old.

To hell with this I am going to write audio codecs from now on.

Have fun guys and don’t hurt yourselves.


There’s lots of hard stuff left to do, even in graphics, you could focus on implementing great detail preserving simplification for example :-). Using OpenGL has never been more math intensive but then you never had to do the really low level stuff with OpenGL. I think the key is to shift your focus of attention to the next frontier which constantly changes.

>>>and hey… the whole design of the p10 for example is to make it easy for the developer… the chip does the whole parallelism for you, you don’t even have to care about really… it will take all the resources it can find… no need to care…<<<

This isn’t something new, pretty much all modern general processors have dynamic issue capability. It makes the die size larger due to increasing complexity, but the result is that even a crappy code can run nearly as fast as software pipelined code.

With graphics cards, we don’t need to care about technical details of processors and that’s the way it should be. For educational purposes, I would say otherwise.

I certaily agree with Devulon in that part of our task is being taken away by the GPU. That’s the way I felt when I first began using OpenGL, but for the sake of hardware acceleration, a nice clean abstract layer… well I love gl now.

What about this fact -> we all will be pretty much coding similar special effects, per pixel, bump, normal map and so on. So we will end up with similar features, so why have a programmable hw for these? My first concern was that it would be slower than a hardwired circuit. Might as well have those “basics” good and ready.


well… they are taken onto the gpu… so what?
rastericing went lost as it got into hardware, cause you can’t code it then anymore…
but all the stuff now moving onto gpu you have to code as you had to before! just on the gpu, not on the cpu…
its a second processor, nothing more and nothing less…

so what?

you don’t loose a friend, you’r friend changed place, and got stronger… much stronger…

i agree with dorbie (btw hes a star, read his name in the news the other day ) + davepermen. graphics programming hasnt become anyeasier . yeah getting a triangle onto the screen is easier than before but now instead of a decal triangle that triangle is displaced with perpixel lighting + shadows + what not else. the bar keeps on getting higher

If you feel that you need control you might make a software renderer in which case you’re not limited by gfx api. Might even use special 3D cpu instructions to speed up the app. Just because 3d in hw is mainstream doesn’t mean that you can’t do 3D in software. It would be a good challenge for you since I think you’re looking for one.

Actually, I’ve now switched from dx built in lighting model to pixel lights and it’s much more math intensive since I have to do everything myself. I don’t think higher level interface to shaders takes anything away from complexity since now the burder of creativity lies on your shoulders. I like programmability in gpu as well as higher abstraction level which allows for a higher complexity in our apps.

uhm well… if you read carefully, this got ALL programable on p10 (about all… )

and you know what? i don’t care about… as i’ll move to raytracing on the next hardware anyways… and never look back… (and for doing raytracing on rastericing-gpu’s, you have to know your gpu, you have to know rastericing, and you have to know raytracing down the lowest levels… so what? the expirience is still there…)