I couldn’t find any on-line Blinn docs using Google (didn’t try that hard though) 
Anyway, I think the “trick” with multiplying with 257 is quite promising. Here’s the approximation:
We want:
C = (A*B)/255 (with rounding)
We do (in integer math):
C’ = A*B;
C = (C’ + (C’ >> 8) + 128) >> 8
The latter is correct for 99.96% of all combinations of A*B. +128 is there for rounding purposes. The required hardware is not even a full 16-bit adder, and I’m sure you can tweak it down to a very limited number of transistors (compared to the 8x8->16 multiplier anyway).
The algo is simple enough to be used in software too, in my opinion.
It would be fun to know how it’s done in actual hardware.