OGL2.0 : programmable texture filters

“Fragment shaders should be implemented as a basic subset of operations that an implementation must allow, followed by additional functionality or generality that an implementation may provide.”

That would repeat several bad things, explosion of extensions and the groovy situation in current DX8 shaders where functionality goes unused because the champ doesn’t have it.

There may be a rather cheap approach for hardware vendors to allow arbitrary functionality and it’s AFAIK also mentioned in the drafts: automatically resort to multi-passing. If the driver can figure out how to break up the shader code into multiple passes, everything would work seamlessly again, no mocking around with display lists or resending geometry, it could all be handled on the hardware somewhere or in the driver layer and it wouldn’t matter how it is handled from the perspective of the application.

Automatically resorting to multipass smacks of “blindingly slow”. Not only does the CPU have to be synchronized with the GPU (so that it can send the same geometry multiple times with different sets of per-pixel functionality), but the driver must set the per-pixel functionality multiple times for each primitive.

At the very least, one should be told when uploading such a program that it may be “blindingly slow”.

There’s already a function you can call that will tell you whether or not a given shader will be multipass or not.

j

Originally posted by Korval:
Automatically resorting to multipass smacks of “blindingly slow”. Not only does the CPU have to be synchronized with the GPU (so that it can send the same geometry multiple times with different sets of per-pixel functionality), but the driver must set the per-pixel functionality multiple times for each primitive.

I don’t think it’s that bad. Multipassing will always cause a performance hit, yes. But I don’t think it must always involve pipeline flushes, if the system is laid out correctly. Maybe even the next gen cards can have a few little hooks that can make this reasonably useful.

[groovy underinformed theory mode]
The driver detects that the current fragment program will exceed the instruction limit. Np, it just daisy chains the four available pixel pipelines together, fillrate will accordingly drop to 25% but nothing has to be flushed, no state has to be changed in between ‘passes’ and everything is still fully pipelined.

When it gets even uglier, a chunk of silicon that caches the output of the vertex shader would be nice. To reduce turnaround penalties for the fragment processor (the multiple configs for the passes should be stored on the chip of course), it could also have an output fifo, so that it could loop bursts of a few hundred pixels back to its inputs (without ever going to the framebuffer) after flipping its configuration. The flipping should work in waves, ie as soon as the first of the four fragment processors runs empty, it should reconfig to the next pass and start working on it, with data taken from the output fifo. The performance drop at this point would be rather steep but at least it’s handled without requiring software intervention and pipeline bubbles are somewhat kept under control.

That may require a lot of silicon, I earlier thought that the best approach would be to just implement a massive SIMD machine with predication for conditionals and push eight or sixteen fragments through at a time. However, what I said above breaks the viability of this approach. But heck, what do I know of pipelines and processor design anyway …

At some crossover point, the silicon real estate of the functional units (multipliers mostly) will probably outweigh the overhead of seperated control logic and it would work again.
[/groovy underinformed theory mode]

I don’t think implementing a fast, general purpose fragment processer is an easy task. It will require lots of transistors to even get started. And it should be clear that a complex shader will always come at a penalty in performance. But with some clever hacks, we may get modestly performing 2.0 implementations quite early. Just don’t expect being able to run a million lines of code on it …