I’m just wondering how GL hardware treats fragment values.

If we for instance have 8 bits per component, the OpenGL spec says that 0=0.0 and 255=1.0. This means that multiplications have to look like this: c = (ab)/255, and NOT (ab)/256, which is MUUUCH simpler to do in hardware. So my question is: how is it done in hardware?

Originally posted by HS: I guess its (a*b)/256 because you can do that with a cheap shift or even hardwire it (then it would be for free).

Yes, I know (in hardware you simply “rewire” the bus, discarding the lower bits). That is exactly my concern.

I was thinking about situations where you would want to interpret the framebuffer values in a custom way (say, use the alpha channel as an exponent). Then it’s very important that 1*x = x. If you want to do it right, I imagine you need another 9-bit multiplication (per component) or something to do the 1/255 scaling, which sounds costly.

Doing it the simple way by scaling with 255/256 also means that some multipass operations will slightly darken the image…

Actually it’s more likely to be a=b*255.0f rather than a=b/255

I think that you are guaranteed that a*1.0f=a
It is a necessary condition for invariance that I’ve read in many specifications (for instance vertex programs have some conditions about it)

For an RGBA color, each color component (which lies in [0; 1]) is converted
(by rounding to nearest) to a fixed-point value with m bits. We assume that
the fixed-point representation used represents each value k/(2^m-1), where k belongs to {0, 1, …, 2^m-1}, as k (e.g. 1.0 is represented in binary as a string of all ones).

First, I have no comments on actual HW implementations, nor do I even know exactly how a multiply is implemented.

For an example of how to deal with a 1/255 term, I would suggest looking at one of Jim Blinn’s books. He has some interesting fixed point/repeating fraction tricks in one of them.

As far as the spec itself goes, it really leaves a lot of freedom on how to do the internal computations, that is why you see all these floating point shaders coming about. It is only specific on how to convert to/from fixed point and float. All the pixel ops are pretty much spec’d to occur logically in clamped floats.

Ok, so 1.0*a = a should hold true in most cases (phew!). I think many algorithms would screw up otherwise…

First, I have no comments on actual HW implementations, nor do I even know exactly how a multiply is implemented.

I don’t know about how it’s done in gfx hw either. It’s always a tradeoff. A division is out of the question. A multiply by 1/255 (represented in some suitible finite form) might be managable (since it’s a constant, you can usually do pretty decent HW optimizations). Actually, 65536/255 = 257.00392…, which is 100000001.00000001… ~= 100000001 binary. Multiplying by 100000001 only requires one adder in hardware

You can probably do it as a LUT too, but for a heavily piplined and parallelized architecture such as a GL chip there would have to be a huge load of such LUTs, taking up too much silicon, I guess.

Jwatte, I am interessted in reading that article. I would appreciate it if you could point me to it (he wrote a couple of books and had his own colum “Blinn’s corner” in the IEEE).

And yes, I looked at the IEEE specification again and x*1.0f = x. I am just cautious when its comes to floating point precision.

To tell the truth, I assumed that GPU’s use the (a*b)/256 trick since that was what I used in my software renderer years ago…

[This message has been edited by HS (edited 04-18-2003).]

I couldn’t find any on-line Blinn docs using Google (didn’t try that hard though)

Anyway, I think the “trick” with multiplying with 257 is quite promising. Here’s the approximation:

We want:
C = (A*B)/255 (with rounding)

We do (in integer math):
C’ = A*B;
C = (C’ + (C’ >> 8) + 128) >> 8

The latter is correct for 99.96% of all combinations of A*B. +128 is there for rounding purposes. The required hardware is not even a full 16-bit adder, and I’m sure you can tweak it down to a very limited number of transistors (compared to the 8x8->16 multiplier anyway).

The algo is simple enough to be used in software too, in my opinion.

It would be fun to know how it’s done in actual hardware.