Programmability for blending and testing

chemdog · February 15, 2004, 10:45pm

I propose to subsume the functionality of the following portions of the GL into a programmable unit: Stippling, Scissor Test, Alpha Test, Stencil Test, Depth Test, Blending, and Logic Op.

The program would have the following input registers:
textel - the color of the pixel as passed from the fragment program or ff-pipeline (R,G,B,A).
viewpos - the viewport coordinate of the pixel (x,y).
color - the value of the corresponding location in the frame buffer (R,G,B)
alpha - the value of the corresponding location in the alpha buffer (A).
stencil - the value of the corresponding location in the stencil buffer (S).

The program would have the following output registers:
color - the color of the rendered pixel (R,G,B).
alpha - the alpha of the rendered alpha buffer (A).
stencil - the stencil of the rendered stencil (S).

The features for this programmability would necessarily include:

KILL - Test Units - prevents rendering of this pixel in every buffer, as if the program had never begun
CMP - Test Units - generates a GREATER, EQUAL, LESSER test result (probably a -1, 0, 1 comparison)

ADD - Blend, Stipple - addition
SUB - Blend, Stipple - subtraction
MUL - Blend, Stipple - multiplication
MIN - Blend - minimum of arguments
MAX - Blend - maximum of arguments

MOD - Stipple - remainder (slightly different than modulus)

NOT - Logic Op - logical inversion of bits
AND - Logic Op - logical and of bits
OR - Logic Op - logical inclusive or of bits
NAND - Logic Op - logical nand of bits
XOR - Logic Op - logical exclusive or of bits

Support instruction might include:
MAD - Multiply and Add
MOV - Move
SGE - Set on GreaterEqual
SLT - Set on LessThan
CLAMP - Clamp to Range (like [-1,1])

It would include Attibutes and Environment like the Fragment and Vertex programs – see GL Extensions GL_ARB_vertex_program SGI_ARB#26, and GL_ARB_fragment_program SGI_ARB#27 – for additional values.

chemdog · February 15, 2004, 10:49pm

The program would also have input and output registers for the depth buffer, somehow they were deleted in transit.

depth - the value of the corresponding location in the depth buffer (D).

depth - the depth of the rendered depth buffer (D).

al_bob · February 16, 2004, 3:47pm

Stippling, Scissor Test, Alpha Test, Stencil Test, Depth Test, Blending, and Logic Op

Line stippling is a little difficult (see this thread )
Polygon stippling can already be trivially done in a fragment program.
Scissor and Alpha test can already be done in a fragment program (test window coordinates, and kill the fragment)
Depth testing would require frame buffer access, unless you first render to texture and feed back the texture to the fragment program. That said, I do recall GLSL providing FB access at some point.

chemdog · February 17, 2004, 1:43pm

Line Stippling requires a simple modification to most line drawing algorithms. Given (say) Bresenham’s Line Algorithm, one adds a stipple buffer to the parameters and a stipple index variable. The stipple index accumulates length for each rendered pixel. The buffer is then indexed cyclically and its value allows or kills the pixel.

For drawing in Octant 0, (other octants similar), the psudeo code is

void LineOctant0(coord x, coord y, uint dx, uint dy, bool sbuf) //sbuf added
{
int dy2 = dy * 2;
int dy2mdx2 = dy2 - (int)(dx * 2);
int err = dy2 - (int)(dx);
double sbufi = 0; //sbufi added
double dsbufi = SQRT(dxdx+dydy)/dx;

if(sbuf[(int)sbufi]) { PIXEL(x,y); }
// old line was: PIXEL(x,y);
while(dx–)
{
if(err >= 0) { y++; err+=dy2mdx2; }
else { err+=dy2; }
x++;
sbufi+=dsbufi; //new line
sbufi %= sbuf.length; //newline
if(sbuf[(int)sbufi]) { PIXEL(x,y); }
// old line was: PIXEL(x,y);
}
}

Which is functionally equivalent to how my MATROX card does this. My NVIDIA card is functionally equivalent to accumlating the length of the infinity-norm(largest coordinate), as opposed to the 2-norm(euclidean distance).
In this octant dx > dy.

void LineOctant0(coord x, coord y, uint dx, uint dy, bool sbuf)
{
int dy2 = dy * 2;
int dy2mdx2 = dy2 - (int)(dx * 2);
int err = dy2 - (int)(dx);
int sbufi = 0; // int so casts below unneeded
// dsbufi missing

if(sbuf[sbufi]) { PIXEL(x,y); }
while(dx–)
{
if(err >= 0) { y++; err+=dy2mdx2; }
else { err+=dy2; }
x++;
sbufi++; // the major coordinate x increased so shall the index
sbufi %= sbuf.length;
if(sbuf[sbufi]) { PIXEL(x,y); }
}
}

I do not have other cards on which to test. So as software, this type of stippling is not difficult, and as hardware, its done with different norms (measures of distance).

Other kinds of stippling exist besides pattern-based or bitmap-based stippling(so called Line and Polygon). In fact the Depth, Alpha, and Stencil tests are all other examples of stippling. This program would seek to subsume all stippling except those occuring elsewhere. (Neither this program nor a fragment program can perform true Polygon Stippling, since it require raster information – the start location of the polygon – that is unavailable. Either program can perform window-based stippling, though this program can base that stippling on the extant buffers’ values, while a fragment program cannot since it has no read access to the buffers.)

The expressivity of the ff-pipeline tests is limited. Each cross-test predicate requires one(or more) rendering pass(es). Render only when (depth pass, alpha>0.5) or (depth fail, alpha<0.25) requires 2 passes. Placing the predicates into a program will allowing rendering to occur in a single pass. The predicate (depth > alpha) is probably not renderable in the ff-pipeline without additional frame buffers.

To clarify, the first 5 functional layers of GL undergo no change by this proposal. The changes occur in the 6th layer:

[ol]

[li] Evaluators And Vertex Arrays[/li]
[li] Vertex Programs Or T(ransform) ff-pipeline[/li]
[li] Clip Planes And Viewport Transformation[/li]
[li] Fragment Programs Or L(ighting) ff-pipeline[/li]
[li] Coverage And Ownership[/li]
[li] This Proposal (buffer program? blend program?) Or B(Tests, Logic, and Blending) ff-pipeline[/li][/ol]

GLSL version 1.051 version cannot access the framebuffer, see issue #7 GLSL spec. Thus, it and fragment shaders can only do alpha-like testing, see issue #6 GLSL spec.

santyhamer · March 8, 2004, 5:27pm

/agree with chem. Fragment shaders should be able to access things like z buffer value, stencil etc.

[This message has been edited by santyhammer (edited 03-08-2004).]

chemdog · March 11, 2004, 10:25pm

Fragment shaders shouldn’t have framebuffer access, but some other shader should. A fragment shader’s output can actually be sent to two different screen locations for purposes such as “Wu Antialisaing” using the coverage values.

If a fragment shader has framebuffer access, a potential visible contribution to the screen could be lost, or a non-visible contribution could be added. Draw a line under the bottom of a square, (something like (0,0)-(1,1) for the rect’s opposite corners and (0,0)-(2,2)the line’s start and end points, with the z further away for the line.) Apply rotations about the origin, and draw the scene under every such rotation. Every image should be the same, (taking into account the rotation). If the fragment shader has access to the depth buffer, the corner where the line protrudes will be rendered incorrectly for some rotations. And this is when, the fragment shader doesn’t even appear to be doing anything different than the ff-pipeline(which would render this correctly for all rotations).

Corrail · March 12, 2004, 12:53am

I also think that fragment shader should have framebuffer access. There are some algorithms where you need information from stored pixels. So than you have to bind them as a texture which is quite ugly.
Another way is to define an addiationally input for fragmant shaders which are taken from a buffer (e.g. a framebuffer).

system · March 15, 2004, 7:52am

Framebuffer access could be implemented in the soft implementation since it’s already slow (for demo purposes, and perhaps for the future).

For hw, it is a problem since it removes all optimization potential (long story). That’s why it’s not added and it won’t be anytime soon.

For this, MRT (multiple render targets) could prove helpful for doing feedback. At least that way you don’t have to be forced to do CopyTexSubmage (or anything like that) on the client side.
Also, the “multiple targets” could all be the same target so you would be able to write multiple times to the same buffer.

There are some things I wonder about like will it be possible to render to 6 cubemap faces in parallel. I had a paper somewhere on my HD about all this.

Corrail · March 15, 2004, 9:23am

I aggree with the fact that framebuffer access is slow and removes optimization possibilities. But nevertheless it will be a really nice feature for some algorithms (especially in combination with MRTs).

Also, the “multiple targets” could all be the same target so you would be able to write multiple times to the same buffer.

What is the idea doing that??

crystall · March 18, 2004, 12:51pm

Originally posted by Corrail:
I aggree with the fact that framebuffer access is slow and removes optimization possibilities.

This is not entirely true, it kills all possible optimizations (early z/stencil tests) just because the GL specifies that fragment operations must be done before those tests which is bad IMHO. I am not sure why this policy was chosen and I am pretty sure that it will be kept in OpenGL 2.0. If only the specification would have put the fragment generation at the end of the per-fragment pipe the problem would be solved. Though this would introduce a different problem, alpha testing, blending and logical operations would need to be done in the fragment shader. Not too bad after all considering that this would provide some very nice programmability to those operations. Then again if it was done this way there must be some reason (something about the implementation?).

[edit] I just thought that it wouldn’t be necessary to make alpha testing, blending and logical ops programmable. It would be simply a matter of pushing the fragment generation after the depth and stencil tests and alpha testing etc… after that. Though this would change the semantic of alpha testing when depth/stencil updates are enabled. ATM if a pixel fails the alpha test its depth/stencil values aren’t written to the respective buffers, pushing alpha testing later in the pipe would change that but would it be so bad?

[This message has been edited by crystall (edited 03-18-2004).]

Korval · March 18, 2004, 5:13pm

This is not entirely true, it kills all possible optimizations (early z/stencil tests) just because the GL specifies that fragment operations must be done before those tests which is bad IMHO.

Which is a part of the spec nobody cares about, as it turns out.

Both the GeForceFX and the R300+ cards provide not only large-scale early-Z checks (16-pixel-wide elements), but the regular depth tests as well. They can do so because there is no difference between doing the depth test before or after fragments, as long as the fragment program doesn’t change the depth. If it does, they have to push the depth test to the bottom of the pipe.

Reading from the current location of the frame buffer is meaningless in terms of early depth tests/alpha tests/etc. The only time that these tests are required to be after the fragment program is if the fragment program can change the outcome of the test. If fragment programs get the ability to change stencil values, then the stencil test will need to be able to happen after the fragment program.

Besides, if the spec were the other way, fragment programs couldn’t change the depth.

The problem with reading from the frame buffer is that you lack a possible source of parallelism and optimization. On some theoretical hardware, you might have two fragments, from two different triangles, running over the same pixel simultaneously. The lower part of the pixel pipe would be able to sort out the results (and tests) correctly, but the fragment program part of the pipe is allowed out-of-order execution. And if you have to prevent this case in your hardware, it becomes more difficult to fully parallelize fragment programs.

Also, what happens if you’re dealing with antialiasing? The results from neighboring fragments can interfere with the value you read. Once again, on the theoretical hardware, if the fragment program beside you finishes early and writes something to the same pixel that the two of you are part of, then you try to read from the framebuffer, you won’t get the same value he did.

crystall · March 18, 2004, 11:14pm

Originally posted by Korval:
[b] Which is a part of the spec nobody cares about, as it turns out.

Both the GeForceFX and the R300+ cards provide not only large-scale early-Z checks (16-pixel-wide elements), but the regular depth tests as well. They can do so because there is no difference between doing the depth test before or after fragments, as long as the fragment program doesn’t change the depth. If it does, they have to push the depth test to the bottom of the pipe.[/b]

That is exactly my point, there are situations (namely when you do not update the z/stencil buffer) in which you can push fragment generation and alpha-testing down the pipe but you cannot do it always, this requires awkward solutions. Either kill early-z/stencil testing or save the new depth/stencil values until after the alpha test and write them only if it passes.

[b]
Reading from the current location of the frame buffer is meaningless in terms of early depth tests/alpha tests/etc. The only time that these tests are required to be after the fragment program is if the fragment program can change the outcome of the test. If fragment programs get the ability to change stencil values, then the stencil test will need to be able to happen after the fragment program.

Besides, if the spec were the other way, fragment programs couldn’t change the depth.
[/b]

Good point, I hadn’t thought to that. But being able to read from the depth buffer would solve this problem, you could just do another depth-test in the fragment program and write to the color/depth buffer only if it passes.

[b]The problem with reading from the frame buffer is that you lack a possible source of parallelism and optimization. On some theoretical hardware, you might have two fragments, from two different triangles, running over the same pixel simultaneously. The lower part of the pixel pipe would be able to sort out the results (and tests) correctly, but the fragment program part of the pipe is allowed out-of-order execution.

And if you have to prevent this case in your hardware, it becomes more difficult to fully parallelize fragment programs.[/b]

Got it.

Also, what happens if you’re dealing with antialiasing? The results from neighboring fragments can interfere with the value you read. Once again, on the theoretical hardware, if the fragment program beside you finishes early and writes something to the same pixel that the two of you are part of, then you try to read from the framebuffer, you won’t get the same value he did.

Ok, I see the problem with reading from the frame/depth-buffer. The only things that strikes me is that all this details seem to have been tailored to a future hypothetical hardware. Now the question that comes to my mind is, if those future hardware turns out more programmable than the current one (which is very likely if we follow the current trends) it will be perfectly possible to program the depth/stencil tests and blending too. In this situation the order in which the GL performs those tests will be meaningless since the programmer could change them at will. Isn’t it a little bit short-sighted to not consider such situation?

lc_overlord · March 18, 2004, 11:31pm

Early aplhatest, stenciltest and depthtest is important if you have a pretty hefty shader, that way you don’t have to run all those instructions.
It’s allso important to be able to read the framebuffer if you want to do custom fog, blending or other more exiting stuff.

Allthough there might be some brawbacks to reading from the framebuffer, there is only a small matter of making shure you do not do something weird.
Normaly you wouldn’t have to read from the framebuffer so paralellism is not a problem since you can use the default depth/alpha/stencil testing, it’s only in special cases you need to read from the framebuffer so if you can avoid doing anything silly like bessing with antialising then you would be ok.

chemdog · March 19, 2004, 1:17am

ATTENTION: Please stop posting about adding any of the tests to the “Fragment” shader. This is a threadto consider adding a “Buffer” shader to the pipeline. It will occur after the fragment-processor, and during the per-fragment processing. (Please read Chapter 4 of the OpenGL specification. Not Chapter 3.) I will soon have a GL_CHEMDOG_BUFFER_PROGRAM proposal ready for RFC.

Korval: is no difference between doing the depth test before or after fragments,

In many cases not, but there are cases where there is a difference. I even posted an example where rendering errors WILL occur no matter what implementation is used when tests are included in the fragment processor.

For hw, it is a problem since it removes all optimization potential (long story).

all -> most. And it forces the fragment processor into FIFO-mode. The ff-test-blending unit is already FIFO, so making that unit programmable isn’t a loss.

Early depth-test kills can still be accomplished in some cases. The behaviour only needs to conform to the OpenGL specification it doesn’t have to be implemented exactly as the state-diagram indicates. On some implmentation, there was a secondary low-resolution(pixel width/height resolution, not depth component bits) depth-buffer that was used for killing whole primitives, if the whole primitive fell behind the low-res depth-buffer. The low-res buffer maintained the furthest depth-value of the normal depth-buffer within the block it covered.

I aggree with the fact that framebuffer access is slow

Framebuffer access is not necessarily slow. Extending the hardware such that framebuffer access occur in two locations (Fragment shader and the Per-Fragment operations), makes it relatively infeasible.

An example of MRT, with multiple targets all being the same target, is using the accumulation buffer. Although this is a scalar-convolution. Usually used for motion-blur, or FSAA, or jitter.

Non-scaler convolutions should be done in separate passes. This is where multiple render targets are extremely useful.

if you can avoid doing anything silly like bessing(sic) with antialising

I can’t access the coverage values in any implemntation I’ve seen. (Though I think SUN has a extension that sets coverage to a specific value). So therfore I can’t mess with anti-aliasing.

a future hypothetical hardware. Now

Hence, the words, “Forwarding Looking Suggestions To Future Versions Of OpenGL”.

perfectly possible to program the depth/stencil tests and blending too. In this situation the order in which the GL performs those tests will be meaningless since the programmer could change them at will. Isn’t it a little bit short-sighted to not consider such situation?

You are making my point better than I have. OpenGL specifies that these operation occur in the Per-Fragment Processor (“Buffer program”), which happens after coverage, which happens after Fragment-Generation (“Fragment Program”), which happens after Primitive Rasterization, which happens after geomtry transforms (“Vertex Program”). Even if the Per-Fragment Operation become programmable, they still occur after the others. I was forward-looking, (I have a software implementaion of the “Buffer Program”), and wanted to specify the program before such hardware existed.

chemdog · March 19, 2004, 1:25am

It has come to my attention that some implementations are moving to FLOATING-POINT buffer components for color(and possibly even for the stencil component).

Are the Logical Operations in need of redefinition for floats? (Current definition is bitwise). This made perfect sense for the fixed-point or bitfield components. As an example, 1 XOR 0 is 1, but 1.0 XOR 0.0 is either NaN or -Infinity. Does anyone know what happens on current hardware? (float_buffer extension)?

system · March 19, 2004, 3:02pm

[QUOTE]Originally posted by chemdog:
ATTENTION: Please stop posting about adding any of the tests to the “Fragment” shader. This is a threadto consider adding a “Buffer” shader to the pipeline.

From reading your first post, I thought you are talking about pulling the other stages into the programmable fragment stage.

Sorry but what is a buffer shader?

Having a more programmable GPU is the idea.

One basic question is how to organize the hw
and the API. On the API side of things, we could pretend there is a secondary programmable unit and we could write a program for it (for the blending unit, logical op unit for example)

This is the kind of issue we could discuss here, unless you don’t want to.

Programmable stipple? Do some people want to do fancy stipple patterns?

Are the Logical Operations in need of redefinition for floats? (Current definition is bitwise). This made perfect sense for the fixed-point or bitfield components. As an example, 1 XOR 0 is 1, but 1.0 XOR 0.0 is either NaN or -Infinity. Does anyone know what happens on current hardware? (float_buffer extension)?

One can assume that values between 0.0 to 1.0 (in float) map to some integer values, for example 0 to 255.
The hardware to do the mapping, perform the logic operation, and convert back to float and write back to the float buffer.
Since Blend operations are said to be expensive on float buffers, I guess the same applies for logic ops.

This is not a must have. If I were writing a spec on this, I would leave it out with a friendly “this feature may be defined in a future extension”.

chemdog · March 19, 2004, 9:07pm

OpenGL Specification
[ul][li] Chapter 2 : OpenGL Operation : The functionality subsumed by Vertex Programs is contained here.[] Chapter 3 : Rasterization : The functionality subsumed by Fragment Programs is contained here.[] Chapter 4 : Per-Fragment Operations And The Framebuffer : The functionality sumbsumed by Buffer Programs is contained here.[/ul][/li]Some of the functionality in each chapter isn’t sumbsumed by the programs; for example, clipping, primitive rasterization(conversion to scan-lines), coverage, pixel ownership, accumulation functionality, and selecting a render target among many others.

Unless you subsume chapter 3 AND ADDITIONAL PARTS OF 3, not yet subsumed, AND ALL OF 4 into fragment programs, there is no point to adding PART of chapter 4 into fragment programs. This would cause awkward hardware and software evolution.

[ul][li] Enable programs, and antialiasing[] Set Vertex Program to a standard transform (ARB_position_invariant or equivalent), make it apply a rotation based on a parameter[] Set Fragment Program a standard one[] Add a depth test as the last instruction of the Fragment Program[] Set Vertex Parameter to some angle theta_magic[] Draw a rectangle (0,0)-(1,1) with some z coord[] Draw a line (0,0)-(2,2) with some deeper z coord[/ul][/li]
The ff-pipeline does transformation, then lighting, then test. So the above collection of programs seems correct. But for some theta_magic values the rendered image IS wrong (on any implementation). On every implementation there should be at least 8 values of theta_magic that are correct, but not necessarily more(you could brute force every bit pattern for theta_magic, if you care to find out how many were right). Some implementations render only these 8 correctly. (I imagine all hw based ones are in this set).

Returning to FLOAT buffers, is there any reason that the buffers are defined to be clampf’s? Otherwise the mapping from [0.0-1.0] to [0- 2^bits] (bits in that color component), isn’t useful.

I was thinking that the floats could be converted to BigDecimals(infinite precision fixed-point) before the logical operations were performed. This doesn’t have bad performance if you do some optimizations, based on the exponent. (Even in hardware). It makes XOR, and NOT difficult though.

I would leave this to a future extension, but since the float_buffer extensions already exist, it would be nice to define their interaction as something other than a future extension will define their combined behaviour.

[This message has been edited by chemdog (edited 03-19-2004).]