DirectX Next

Current framebuffer pixels could easily be prefetched early, whenever they’re required inside a fragment program. They have to be fetched for fixed function blending, too, after all. You’d just need to do it a tad earlier to hide latencies.

Multiple instances of the same (x/y) fragment shouldn’t be in the pipe at the same time anyway. That’s an extreme ordering hazard and would probably just blow up in your face (again: same thing with fixed function blending).

I can’t find a reason why that should be hard to implement at reasonable performance levels. Fixed function blending costs bandwidth, programmable blending costs bandwidth and … what?

edit:
Int to float conversion. Is that a good reason?

[This message has been edited by zeckensack (edited 12-09-2003).]

Originally posted by zeckensack:
Current framebuffer pixels could easily be prefetched early, whenever they’re required inside a fragment program. They have to be fetched for fixed function blending, too, after all. You’d just need to do it a tad earlier to hide latencies.

Doesn’t work. Say you are rendering to the same pixel twice on two consecutive triangles (think overlapping, blended particle effects). If your shader execution is multithreaded, then you’re in for a heap of trouble if you want to change to a pixel on the second triangle before the first is complete. That’s as much as I’ll say.

Multiple instances of the same (x/y) fragment shouldn’t be in the pipe at the same time anyway. That’s an extreme ordering hazard and would probably just blow up in your face (again: same thing with fixed function blending).

No, it’s not a problem at all if you don’t allow frame buffer access within the shader.

I can’t find a reason why that should be hard to implement at reasonable performance levels. Fixed function blending costs bandwidth, programmable blending costs bandwidth and … what?

It’s not just programmable blending. You’re moving a whole chunk of the pipeline (i.e. the blending unit into the shading unit). Blending can normally be done independently of shading. That won’t be the case if you allow framebuffer access in the shader.

There is no example I can give that can satisfy Korval, because he will always be able to come up with some other multi-pass method of doing the same thing.

That’s not true.

If you were to argue for the inclusion of arbitruary texture access in a vertex program, as opposed to GeForce 1 technology (or even NV20-level stuff), you could make a compelling case for it. You could say that:

1: Without hardware support, implementing this is, at best, prohibitively expensive.

2: With hardware support, a vast number of possibilities present themselves. From good shadow mapping to EMBM to a wide variety of other, genuinely useful, visual effects.

The reasoning for the feature is both clear and convincing. Each effect might be doable in another way, but the sheer quantity of effects that this allows, coupled with the painful nature of the alternatives, makes this feature almost self-justifying.

Can you say the same for the arguments you have postulated here? So far, we have some fog (whether it is water or atmosphere, it is the same effect), and some nebulous “I can’t count on both hands how many times I’ve wished I had this feature” kinds of things, which can’t really be evaluated on a case-by-case basis.

Would I mind it if my R400 or R500 had this feature? Probably not, unless that made it slower overall than its nVidia counterpart. Would I care if it never saw the light of day? Probably not, assuming that programmable blending became a reality at some point (register combiner-level functionality would be sufficient). In the grand scheme of things, it just isn’t that important.

What is it about the design of a modern graphics card that would make reading from the destination buffer at the position of the current fragment impractical?

It isn’t the design of modern graphics cards that is a concern. It is the design of future graphics cards that would be limitted by this decision. Effectively, it means that it is impossible to allow hardware to have multiple fragment programs “in flight” over the same pixel/sample, even though, from a performance standpoint, this might be a worthwhile idea. They could have the “blending” unit sort out which fragment gets written and blended as an asynchronous process to running a fragment program.

Originally posted by OpenGL guy:
Doesn’t work. Say you are rendering to the same pixel twice on two consecutive triangles (think overlapping, blended particle effects). If your shader execution is multithreaded, then you’re in for a heap of trouble if you want to change to a pixel on the second triangle before the first is complete. That’s as much as I’ll say.
This is what happens with fixed function blending, too.
If you’re of course talking about something akin to OOOE CPU designs where you “retire” in order at the end of the pipeline only, all I’ll say for now is cough … and I wasn’t aware of that.

No, it’s not a problem at all if you don’t allow frame buffer access within the shader.
Ditto, sort of.

It’s not just programmable blending. You’re moving a whole chunk of the pipeline (i.e. the blending unit into the shading unit). Blending can normally be done independently of shading. That won’t be the case if you allow framebuffer access in the shader.
I see. It doesn’t quite sound like what I had in mind, which was:
Move the color buffer read to an earlier stage, but not “the blending unit” as a whole.
Whenever a 2x2 pixel quad, or whatever you happen to use enters a fragment processor twinkle, and the current fragment program wants read access to target.color (or so), fetch that block from the target and pass it down the fragment processor along with the interpolator outputs. And spec it as read only.

Assuming the quads are generated roughly in order, and multiple quads generated at the same time never overlap (?), this might just work. You then don’t even need to “retire” in order, because you took snapshots of the target contents at the right time. It doesn’t look like it could come for free, of course. The stuff obviously needs to be stored somewhere.

I am by no means a hardware designer. I’m just thinking out loud.

Korval,

!!whatever

PARAM rgb_to_luminance={0.3,0.59,0.11,0.0};
DP3 result.color.rgb,target.color,rgb_to_luminance;
MOV result.color.a,fragment.color.a;

Combine that with fixed function blending to yield the dreadful discoloration cloud, a weapon so evil, it must be wielded by a madman only Batman can hope to stop

Move the color buffer read to an earlier stage, but not “the blending unit” as a whole.

Fundamentally, that’s the same thing. If blending is off, you can still do the blending operation in the shader. And, since you’re taking it as an input, it must be assumed that the output will vary depending on this value. As such, it’s really no different than blending.

Whenever a 2x2 pixel quad, or whatever you happen to use enters a fragment processor twinkle, and the current fragment program wants read access to target.color (or so), fetch that block from the target and pass it down the fragment processor along with the interpolator outputs.

Not good enough. A currently-in-execution “quad” could be about to write to this value. You don’t want it read until that quad has written to it. Which means that a synchronization event must occur in the middle of the pipeline.

Combine that with fixed function blending to yield the dreadful discoloration cloud, a weapon so evil, it must be wielded by a madman only Batman can hope to stop

Huh? I’m not sure what this is even in reference to.

this synchronisation issue should only cause a slowdown if you have to multipass individual triangles that are at the size of about … 8 pixels. else, it can schedule those pixels and continue with other, independent ones.

oh, and, that scheduling can happen automatically by simply drawing the stuff in order.

i know there are issues. but they are NOT a problem in any normal situation. as long as you draw just one triangle, all your pixels are actually processed independent from the others. only from one to the next triangle, there can be overlap. this, combined with backface cullling, is a not-often happening event.

and there will be no need for the blending unit at all if we can access the “dst_color”. a MUL MAD can do anything then.

this colourkilling example is just a funny idea to show what tons of features you could do with it…

btw, there are tons of algos that are not possible with the flipflop multipass method… except with individually scheduling and flipflopping for each triangle. not practical. you look rather restricted, korval (not to say braindump… you’re not…), if you don’t see the uses of this.

I’m willing to take a performance hit for it if that’s neccesary. It’s like with depth-writes from a shader, which causes a large performance reduction, and isn’t nearly as useful.

First off, I’m glad people that here are mostly taking the approach: “DirectNext hints at future hardware capabilities” rather than the tiresome “Microsoft sucks” rants. I, too, am curious about the whole SuperBuffer thing. It has been suggested here that they are probably adapting the SuperBuffer concept to a more VertexBufferObject style. They would need some more BufferObject targets (PIXEL_PACK and PIXEL_UNPACK have already been hinted in the PBO section of NVIDIA’s VBO whitepaper), and some uniform way of swapping and copying buffers. That might be nice: you’d get the ability to double/triple buffer not only your frame buffer, but your vertex buffers, index buffers, textures, etc.

But while we’re all playing “OpenGL Hardware Designer” here:

Yes, it would be possible to schedule around potential fragment->pixel hazards, but do we really expect that to be in the next-generation implementations? Personally, I don’t (maybe next-next…). I want my GPU to be a lean mean stream processing machine. I want the transistors to be there to do computation, not scheduling, so such hazards should be enforced by the API (now, and in the near future). Personally, I’d rather see floating-point blends and programmable texture fetching/filtering. Both are useful and neither would break parallelism.

When it comes down to it, DirectNext has some great things, like the unified shader model (with integer instructions), programmable tesselation and the topolgy processor. Covers about 95% of the things on my wish list.

-Won

Originally posted by Korval:
Fundamentally, that’s the same thing. If blending is off, you can still do the blending operation in the shader.
No, it isn’t because no, you can’t. That’s what we’re talking about.

And, since you’re taking it as an input, it must be assumed that the output will vary depending on this value. As such, it’s really no different than blending.
It’s different from fixed function blending nonetheless (ie it’s more flexible, in fact an entirely different beast, as seen by the little example I gave).

Not good enough. A currently-in-execution “quad” could be about to write to this value. You don’t want it read until that quad has written to it. Which means that a synchronization event must occur in the middle of the pipeline.
Then so be it. Let the quads going down be scheduled so that there are no ordering hazards. I’ve outlined how I’d imagine that could be done:
1)take a snapshot of the target contents at the time the quad starts into fragment processing
2)block whenever two (or more) overlapping quads would be in flight at the same time

Read access to fragmet.position.z isn’t free either. I never complained about that, it’s just to be expected.
(and as an aside, read access to target contents is a lot more interesting than fragment.position.z IMO)

I really wonder how often issue #2 crops up in reality. How bad is it, really? I honestly don’t know but I’d like to.

Huh? I’m not sure what this is even in reference to.
I’ve made it up. In addition to the shader, you’d set glBlendFunc(GL_SRC_ALPHA,GL_ONE_MINUS_SRC_ALPHA); and draw arbitrary geometry over the finished scene (say, a particle system). Only input alpha matters.
Where alpha==1.0, you’ll turn the framebuffer to pure intensity.
Where alpha==0, the color buffer is unchanged. For everything else, you get a linear blend between full color and intensity.

Stupid, fancy, unheard-of special effects, so to speak

(and you simply can’t do it with fixed function blending alone, unless, of course, you copy generous portions of your render target to a texture)

I’m willing to take a performance hit for it if that’s neccesary. It’s like with depth-writes from a shader, which causes a large performance reduction, and isn’t nearly as useful.

My concern is not that there will be a performance drop from using it. My concern is that the feature would require a restructuring of the entire back-end of the renderer, and that such restructuring would either prevent the use of performance-enhancing features (like having multiple quads in the pipe) or dramatically complicate the back-end logic, thus increasing the cost of the chip or costing us other, potentially useful, features.

do we really expect that to be in the next-generation implementations?

DX Next will come out with Longhorn in 2006. As such, the API is going to be something of an indicator of the expected functionality of cards of that era. Not of the cards of next year.

Personally, I’d rather see floating-point blends and programmable texture fetching/filtering. Both are useful and neither would break parallelism.

Programmable texture fetching sounds like it’d be really slow, but floating-point blending is clearly something that would be of great value in the (near) future.

No, it isn’t because no, you can’t. That’s what we’re talking about.

The point I was making is that if you “Move the color buffer read to an earlier stage, but not “the blending unit” as a whole.”, then it is the same as what we are discussing. It isn’t an alternative to moving the blending into the fragment shader; it’s the exact same thing, because if you could read the framebuffer from the shader, you’d never used fixed-function blending again.

Read access to fragmet.position.z isn’t free either.

Read access is free (or, at least, pretty cheap). Write access isn’t, since it screws up all the fast z-culling hardware.

and as an aside, read access to target contents is a lot more interesting than fragment.position.z

True. But read access to the framebuffer is much more difficult than simply giving the fragment program the computed z-depth.

I really wonder how often issue #2 crops up in reality. How bad is it, really? I honestly don’t know but I’d like to.

Well, it never happens on an ATi chip because the hardware isn’t designed to have multiple “quads” in-flight simultaneously. Apparently, this is not true for FX chips. I don’t imagine that it would come up too much, as you would have to have a pretty deep pipeline for it to happen, but the hardware designers would have to devote resources to preventing the problem in any case.

and you simply can’t do it with fixed function blending alone, unless, of course, you copy generous portions of your render target to a texture

Well, technically, you don’t have to “copy” it. With ATI_draw_buffers, you can write the color to the frame buffer and write the luminance to an AUX buffer. From there, assuming ARB_superbuffers, you just bind that buffer as a texture, and you can do regular blending as a post-process.

Im not that sure that its that costly… former ati drivers showed signs of gl_FBcolor in their beta of GL2 shaders ( before the final spec was approved ), and i know that nVidia allows binding a pbuffer as a texture at the same time you render to it… this seems to be pretty much the same requirements. Expecially when the superbuffers seems to allow whatever as rendertarget, so the ‘real’ frambuffer and other various rendertargets (pbuffer for now) seems to be handled much alike.

Originally posted by Mazy:
and i know that nVidia allows binding a pbuffer as a texture at the same time you render to it…

If I may add my grain of salt, I tried that, and it does not work. It is possible though to bind the pbuffer as texture, render it on a different context and omit wglReleaseTexImageARB before rendering again to the pbuffer. That is not the same, here there is context switching, so the card can release the texture binding on its own.

Originally posted by Korval:
The point I was making is that if you “Move the color buffer read to an earlier stage, but not “the blending unit” as a whole.”, then it is the same as what we are discussing. It isn’t an alternative to moving the blending into the fragment shader; it’s the exact same thing, because if you could read the framebuffer from the shader, you’d never used fixed-function blending again.
I wouldn’t say so, not with fixed function blending implemented in fast integer hardware. That would be a LRP and a MAD for fully general ‘emulation’, and I’d rather do that with only the required precision, which is not necessarily floating point.

And indeed, my made-up example didn’t go there.

Originally posted by Korval:
Read access is free (or, at least, pretty cheap). Write access isn’t, since it screws up all the fast z-culling hardware.
In theory, yes. Last time I checked (which was with Cat 3.4 IIRC), reading from fragment.z alone caused a heavy performance drop in an otherwise simple shader.

I’ll repeat the test once I’m finished poking at my brand new 9200
(just to make sure I’m not talking nonsense)

Originally posted by Korval:
Well, technically, you don’t have to “copy” it. With ATI_draw_buffers, you can write the color to the frame buffer and write the luminance to an AUX buffer. From there, assuming ARB_superbuffers, you just bind that buffer as a texture, and you can do regular blending as a post-process.
MRTs?
Well, yes, not technically a copy, but at even higher bandwidth cost.
Yours:
a)write color buffer, write luminance buffer (whole viewport)
b)read color buffer, read luminance buffer, blend (region of effect)
c)write color buffer (region of effect)


two reads, three writes

Mine:
a)write color buffer (whole viewport)
b)read color buffer (forward this to fixed function blending, if possible), compute luminance (region of effect)
c)blend, write color buffer (region of effect)


one read, two writes

If the region covered by the effect is small(er than the viewport), it gets a lot worse quickly, because I have to pay to cost for writing the whole viewport to the luminance target.

ZBuffer : you can be right, i havent tested that myself, but the technique is described in http://developer.nvidia.com/docs/IO/8230/GDC2003_SummedAreaTables.pdf with the warning “results may be undefined”, just as teh spec says about this, but they have showed a demo of it, so at some point it had to work on their cards.

Korval – you stuck a quote from me in your reply to Zeck…

Programmable texture fetching/filtering doesn’t need to be slow. You’ll have to deal with some extra latency in the case of particularly complex fetches, but the implementation would really only need to make sure that the standard modes, when implemented in programmable form, are fine. Probably harder than I make it sounds, but there are no obvious reasons why it might be slow.

Assuming you still have access coherency, you then only need to deal with the occassional stall, but then you just need to make your texture cache line big enough so that you can mask that latency by having multiple fragment execution threads.

Aside from being able to define your own texture filter kernels, you can define your own texture formats, wrap modes etc. And maybe there are funky things you can do when you use it to address geometry image textures or something.

-Won

Sorry, can’t edit the post without it blowing up …

Originally posted by zeckensack:
If the region covered by the effect is small(er than the viewport), it gets a lot worse quickly, because I have to pay to cost for writing the whole viewport to the luminance target.

What I should add is that in pure theory, “my” approach uses a lot less bandwidth even with the whole viewport covered. If only a small region is covered, the difference gets even bigger.

It is to be expected that allowing this sort of access at all comes at a cost (stalls to avoid race conditions; more on-chip storage is used). I’m pretty confident that the bandwidth savings can outweigh the initial performance costs. Now I’d need to figure out the cost in transistors, which I quite frankly just can’t.

There are probably many other issues (besides raw transistor count/area), like design validation times etc. I think it’s probably a good guess to say that for the next few years, GPUs are going to approach multi-threaded in-order stream processors because it is the most easily scalable approach. Then again we’re all basically talking out of our ass unless you’ve designed a GPU before.

-Won

My favourite is Mesh Instances on the general IO page. It looks like display lists with variables.

About the ability to access the FB withing a fp…

Personally, I was expecting this feature to be thrown in soon. I don’t see what the big deal is. In fixed pipe mode, if you enable blending, then obviously the blending unit has to to access the FB to do the blending and eventually will have to write back values.

So there are 2 ways to offer the solution : Programmable blending unit and developers have to write a separate program for it or just extend the current fp language.

Geez! The name of the game is to offer programmability here. Why argue against it?

If you are talking about random FB readbacks, then it can get complicated, and you can get undefined results as fragments may have interdependencies.

Korval, reading your statement about NV40/R420 having higher blending precision…do you have a reference for this proposition? I enquired several times at ATI about this without success.

BTW, processing the same fragments belonging to different triangles cannot in general be performed concurrently since order matters in the fixed function pipeline as well (unless you use min/max blending). So, I don’t see how providing the fb color in the fragment shader adds new dependencies here (you have to take care of what you render first anyway and further parallelisation could be achieved by shading fragments of objects which do not overlap).

Cheers!