Screw NVidia!

hehe. Ok I admit, probably not every CPU. But the cheap CPUs most people have in their home machine(P3, Athlon…)

[This message has been edited by Gorg (edited 01-18-2001).]

“pretty superscalar and pipelined” is exactly how 3D chips are designed. In fact, 3D is probably the most powerful example of a massive pipeline at work. The entire 3D pixel back end is a pipeline that can be implemented over hundreds of pipeline stages.

  • Matt

yer… grapchis chips aren’t the ONLY thing that’s pipelined, tho. what i was refering to was that the whole virtual opengl pipeline was pipelined… (as opposed to, say, reusing the ALU for different stages which would cause a resource stall pretty freaking frequently)

100s of pipeline stages sounds slightly enthusiastic. the idealised DLX pipeline is five stages, the p3 has fifteen, p4 has twenty-two… each pipeline stage needs latches to propagate info to the next instruction… i’m a bit dubious about “hundreds” of stages…!

[This message has been edited by john (edited 01-18-2001).]

Why can’t a GeForce have hundreds of pipeline stages? GPUs are fundamentally different from CPUs because they are essentially dataflow machines where
data gets pushed in one end and the results pop out the other. High cycle latency (one penalty associated with having deep piplines) isn’t a big problem because “feedback loops” (like in a CPUs writeback stage, where results are stored back into the register file) aren’t nearly as significant.

This makes sense. A CPU is designed so that it’s computational flexibility comes from programmable functional units. GPUs really don’t need to work that way; functional units there just do their job and pass their results along. GPU functional units aren’t programmable like those in a CPU; instead, a GPUs computational lexibility comes from programmable datapaths.

Besides, I think Matt has a pretty good idea of what’s going on inside them NVidia GPUs. 100’s of stages is higher than I would’ve guessed, but I don’t think anyone here has any reason to doubt what he says.

Well, you’re comparing to a CPU, and graphics chips are designed quite a bit differently than CPUs.

[disclaimer: I am not a HW expert, I consider myself to have a fairly introductory level of knowledge about VLSI design at best]

You don’t have to use hundreds of stages, but you can. Specifically, you can think of each block in the OpenGL pixel pipeline as being at minimum one stage, probably more. Maybe you want a single adder or multiplier to be one stage. Maybe you want to even pipeline your multipliers.

Remember also that there are nice big long memory latencies that 3D pipelines cover up. Each cycle of memory latency is a pipeline stage.

So when you consider all the steps a pixel has to go through (this would be for a fairly direct implementation of OpenGL in HW):

  • interpolation of texcoords, colors, fog, etc.
  • texture wrap modes
  • texture address computation
  • texture lookup (memory access)
  • texture filtering
  • texture environment math (possibly several times sequentially, for each texture unit)
  • color sum
  • fog blend
  • AA application
  • polygon stipple application
  • line stipple application
  • alpha test
  • read color, depth, stencil as necessary (each may be a memory access; you also would want interlocks with the writes)
  • depth/stencil tests
  • blend
  • dither
  • logic op
  • color mask
  • write color, depth, stencil as necessary

Some of those might be just 1 stage, some might be 5 or 10; all depends on the implementation.

  • Matt

yep. i conceed your point. When you said pipeline i immediately thought of… well, cpu’s. I should have thought some more. =) (incidentially, i’ve only seen the “things” in dataflow models as being nodes… since a pipeline is inherently linear and dataflow isn’t. although it would be CLOSE to linear in graphcis context=)

cheers,
John

Well with 3Dnow+, SSE, and Altivec (Motorola G4 vector engine) are all pretty nice, and do speed up things alot, some times 10x improvement.

There are good and bad hardware T&L, (like the Savage 2000, which was broken crap), I haven’t used a GF2, but I assume it is faster/better.

The problem is still in fill rates, all the high end processors now still are waiting for the gfx card, and while they are waiting, they can (and do) handle software T&L faster than many gfx cards can, And do much more, like AI and all that other good stuff. I think I read by end of 2001, no more CPUs under 1GHz will be sold.

So how exactly do you go about taking advantage of the GE force’s T&L engine?