Geometry programs

Overmind · July 12, 2005, 9:37am

No, I disagree. When we start with something extremely simple like a heightmap and extend from there, we’ll get a real mess. It’s better to start with as much flexibility as possible, but try to stay within reasonable limits.

Perhaps vertex programs should run for every vertex generated by a geometry program?

Of course. I think that’s one of the few things we all agree on in this thread

I’m not sure if it’s a good idea to actually store the vertices somewhere, because this can defeat parallelism. It’s the same problem that’s currently discussed in the “map vs. no-map” thread in the advanced forum. Better directly feed the output to the next stage (at least semantically, the driver may allocate a buffer if it really needs to).

I do not agree with then idea en/disable should affect vertices sent to the server.

Aehm, you do realize that we have this already? When you enable a vertex program, vertices are sent to this program, when you disable it vertices are sent to the fixed function pipe. It’s exactly the same, just one step earlier. Enabled: vertices are sent to the geometry shader, disabled: vertices are passed through (one could see pass-through as the “fixed-function” geometry processing).

After thinking about it, I think it would be better to not use glEnable/Disable, but a special program with ID 0 to disable the geometry shader. It’s semantically the same, but consistent with the vertex/fragment shader spec.

Overmind · July 12, 2005, 9:53am

Ah, finally I think I understand the problem with display lists.

You’re talking about compiling a display list and then calling it once with geometry shader enabled and once with geometry shader disabled… But I still think there is no real problem.

You forgot Option 6: The driver writer just has to cache the display list contents in normal RAM and execute the display list in software whenever a geometry shader is enabled.

No need to specify anything special, a display list should remain what it is now, that is, it should behave the same as copy/paste code.

The problem is the same as e.g. display lists with vertex shaders emulated in software, and obviously the driver writers have solved the problem without any restriction in the spec. And in a few years we won’t have to worry about it anyways because every card will be able to do geometry shaders in hardware, then every restriction put on the combined use of display lists and geometry shaders will just be an obstacle…

Btw. I never planned that display lists can be called from geometry shaders. Only the other way round.

Regarding the use of arbitrary buffers: I think it would be better to use arbitrary accessible typed arrays, similar to vertex arrays. Then the byte-order issues are taken care of. I assumed this was already clear, so I didn’t elaborate

Zengar · July 12, 2005, 9:55am

Tamlin, read my posts This is exacly the same as you want…

Now, to the “mesh” definition. Well, we don’t care. We can say that mesh is coded withing several vector arrays. It’s all in my first code sample. To define NURBS patch, we setup a 3-component vector array with size of 16*16=256 entries. Of course, use of textures would be also interesting - but in the end, textures are vector arrays.

To performance: let’s say, a vertex needs approximately 20 clocks, this will leave us with 20 mln generates vertices/sec on 400Mhz GPU. Is it enough? I think no… If one takes the parraleled approach, this number will of course raise…

But anyways, it could be the next big step. As we now abandoned hight triangle count for per-pixel lighting, we can abandon triangles for higher-order surfaces…

T101 · July 12, 2005, 10:00am

Tamlin: sorry if it looks like you’re being ignored - you and I seem to have about the same ideas about this functionality.

That said:

Perhaps vertex programs should run for every vertex generated by a vertex program?
Do you mean “VPs should run for every vertex generated by a GP”? Or do you really mean to create a loop?

I think most of us are thinking of placing GPs before VPs (also because VPs have more opportunities for parallellism), so if that’s what you meant, most of us agree.
Overmind was thinking of possibly placing the GP after the VP, in my opinion that is more likely to result in incorrect interpolation if the VP skews the view.
In any case, we all seem to think that this would not replace VPs.

Frustum checks could be done in the GP, but it should be limited to the fixed function transforms. (Assuming the GP is processed before the VP)
Best not to use that frustum check if you’re also skinning in the VP.

Overmind · July 12, 2005, 10:04am

Originally posted by Zengar:
As we now abandoned hight triangle count for per-pixel lighting, we can abandon triangles for higher-order surfaces…

What about programmable interpolators? That is, take a few vertices and output tons of fragments

Seriously though: We’ll still need triangles. That’s what a geometry program will do, convert high-order surfaces to triangles. This conversion is neccessary as long as you don’t use raytracing.

T101 · July 12, 2005, 10:16am

You forgot Option 6:
Actually, that’s option 1

The problem is the same as e.g. display lists with vertex shaders emulated in software,
True. It’s all about adoption rates. Once every card has the hardware support it no longer matters.
But considering the complexity of a geometry shader, it might be more feasible now to do in software than vertex shaders ever were.
If you had vidmem vertex arrays (whatever it was that came before VBOs) but software vertex shaders you were screwed.
It’s essentially being done on the CPU already, and you could throttle the detail level. Hopefully, getting software fallback shouldn’t be much slower than a pure CPU implementation.
By having a software option now, people can get experience with it and give some real feedback.

T101 · July 12, 2005, 10:36am

Now, to the “mesh” definition. Well, we don’t care. We can say that mesh is coded withing several vector arrays. It’s all in my first code sample. To define NURBS patch, we setup a 3-component vector array with size of 16*16=256 entries. Of course, use of textures would be also interesting - but in the end, textures are vector arrays.
How about being able to access the texture samplers as well as true vector arrays? The nice thing about the texture samplers is that they can do interpolation for you.
Of course, we’ll still need to know which vertices we’re dealing with, so we can match them with texture coordinates. But I think that would be easy enough: let the GP access all vertices by index (0 being the first vertex in the vertex array), and pass the width of the source array as integer parameter if needed.

To performance: let’s say, a vertex needs approximately 20 clocks, this will leave us with 20 mln generates vertices/sec on 400Mhz GPU. Is it enough? I think no… If one takes the parraleled approach, this number will of course raise…

That’s where LOD comes in though, isn’t it?
You only add detail where it is needed, not where it’s not.

Anyway: I get the impression that we’re all agreed that option A2 is off the table, and that we’ll be generating triangles from arrays and possibly textures.
Can we reach a conclusion as to whether it’s better to place it before or after the vertex program? Somehow it doesn’t seem fair to call a vote, since it seems like Overmind is the only one strongly in favor of placing it after the VP.

My guess is that it’s better to place it near the source - specifically near a vertex array, for the following reasons:
-Allowing direct-mode vertices in prohibits any meaningful matching by index with other source data.
-Software emulation requires VP software emulation and that will discourage early adoption of GPs (creating a chicken/egg problem)

Overmind/V-man: can you guys come up with any critical reasons to place GP processing AFTER the VP?

Overmind · July 12, 2005, 11:03am

Actually, that’s option 1 [Big Grin]
Whoops :rolleyes:

Overmind/V-man: can you guys come up with any critical reasons to place GP processing AFTER the VP?
No, you’ve already convinced me a few posts earlier.

T101 · July 12, 2005, 10:25pm

OK. So let’s start summarising what we’d need and what concessions can be made to driver developers - I’d suggest just throwing anything in for now (basic math excepted), and later check if it’s already covered by existing functionality.

Concessions:

if temporary storage of vertices is needed before passing them on to the vertex transformation stage, for example when the geometry processor is emulated in software, the driver is free to interrupt the geometry program when such temporary storage is filling up, process the generated vertices, and continue the geometry program. More fine-grained scheduling is also aceptable, and may be advisable in order to keep the vertex pipeline from being idle.
This guarantees that even with limited resources, the geometry program will work correctly.
This does mean that the geometry program is not allowed to read back or replace vertices already written. If buffering of vertices is required, that has to be done by the geometry program itself.
-Parameters are bound like uniforms are bound now. They don’t end up in the function call that starts the geometry program. This prevents the need for a variable argument list with variable types, and should therefore simplify implementation.

Functionality needed:

access source vertices as an array instead of first-in-first-out
texture sampler functionality (at least two samplers, to allow for example both a heightmap and a first derivative)
fixed function transform for frustum check
perspective division for frustum check
vector parameters
integer parameters
non-unrolled loops (probably)
primitive begin/end
emission of vertices - including attributes
interpolation between vertex values
a vertex structure containing all vertex attributes (alternatively just multiple arrays with the same index)
…

Overmind · July 13, 2005, 12:46am

Originally posted by T101:
[b]- fixed function transform for frustum check

perspective division for frustum check
[/b]
The normal frustum check is after the vertex shader, so this is not really neccesary. Ok, a frustum check would be useful for preliminary culling or lod, for example, when culling a bounding box. But I don’t see why one would need fixed function transform for that.

Basically this was the reason why initially I wanted the geometry shader after the vertex shader, so it can access already transformed vertices. But nobody hinders the writer of the geometry shader to do it manually. Access to uniforms is provided, so the modelview and projection matrices are available…

…
One thing I’d like to add:

an immutable vertex handle

This way a program can “compile” a vertex structure into a vertex handle and the implementation has the guarantee that this vertex is never altered. So it knows it has to be transformed only once, and every further appearance of the same handle can use the cached results.

With a vertex structure alone the implementation has no guarantee that the program won’t alter a vertex structure after it has been written to the output stream, so vertex caching would be impossible.

T101 · July 13, 2005, 2:18am

But I don’t see why one would need fixed function transform for that.

I suggested it since it will do the trick without having to get the matrix back from OGL in order to specify it. Also, I believe there’s an ftransform() function that already does it in the vertex shaders.
Anything more advanced requires duplicating the transforms the vertex program does in the GP.

This way a program can “compile” a vertex structure into a vertex handle and the implementation has the guarantee that this vertex is never altered. So it knows it has to be transformed only once, and every further appearance of the same handle can use the cached results

I kind of figured that was guaranteed by making the output vertices write-only (first concession, in order to accomodate non-pipeline implementations).
But maybe you mean across multiple calls of the same GP with (roughly) the same source data - that would not be covered by that concession.
Still, you’re calling it a handle, wouldn’t a simple flag be enough? If it’s at the individual vertex level, that seems to make more sense to me.
If you buffer all the results, I think that would qualify for using handles (and those might in fact be implemented as VBO handles).

Maybe we should add a hint and a builtin:
-hint: this GP benefits from caching
-builtin: “reissue cached results”. A call that does nothing if caching isn’t enabled, and exits the GP if caching is enabled (and just reissues the vertex cache buffer).

But we’d still need a way to compare the new parameters to the old parameters so a good comparison can be made. Preferably reading the old and new parameter values, and from within the GP itself, so you don’t have to recalculate if the camera has only moved by 0.000001 units…

Overmind · July 13, 2005, 3:23am

No, what I meant is outputting the same vertex multiple times, within the same invocation of the geometry shader. If we output a vertex stream with one of the default OpenGL primitives, this may be neccesary (think of a bezier patch, even when using triangle strips, most vertices are used at least twice).

Perhaps it’s time again for a concrete example:

struct my_vertex {
    // these are the vertex attributes as expected by the vertex program
    float4 position;
    float2 texcoord;
    ...
};

vertex calculate_my_vertex(int x, int y)
{
    my_vertex ret;

    // fill in structure

    return vertex(ret);
}

void main()
{
    vertex row[32];
    vertex temp;
    int x, y;

    for(x = 0; x < 32; x++) {
        row[x] = calculate_my_vertex(x, 0);
    }

    for(y = 1; y < 32; y++) {
        glBegin(GL_TRIANGLE_STRIP);
        for(x = 0; x < 32; x++) {
            temp = calculate_my_vertex(x, y);
            glVertex(row[x]);
            glVertex(temp);
            row[x] = temp;
        }
        glEnd();
    }
}

This program outputs a simple rectangular patch. Assuming the program submits vertices using a pseudo-immediate mode syntax. There may be other possibilities, the point is that each vertex except the first and last row is used twice.

If we used the real immediate mode syntax or pass the struct to the glVertex call, there would be no possibility for the implementation to know if it’s the same vertex or if it has been modified. This is solved by the transparent datatype “vertex”. Every call to its constructor creates a new vertex that can’t be altered and only has to be transformed once. It has no internal structure, the only things you can do with it is creating it and using it for drawing something (in my example as parameter to glVertex).

Note that this has nothing to do with using a vertex function call vs. using a triangle function call vs. writing to an output array. The problem is the same in each of the proposed methods.

One other possibility to solve the problem would be to write output vertices to an array and then produce an index stream into this array. This is basically the same on the usage side, but much less flexible on the implementation side. With a transparent vertex type the implementation may still store every vertex in an array and convert the vertex type to an index into this array internally, but it may choose to use a different strategy…

About caching through multiple invocations of the program: I don’t think there’s a chance this could ever work reliably. You can’t just compare values, as exact equality is very unlikely with float, and depending on what the vertex shader does with its input, an input error of 0.000001 may produce an output error of 10000

T101 · July 13, 2005, 3:58am

think of a bezier patch, even when using triangle strips, most vertices are used at least twice
Clear illustration. I agree.

You can’t just compare values, as exact equality is very unlikely with float
That’s why I suggested placing that check within the GP. In there at least you can say “the camera has moved by only a small amount, so recalculating doesn’t result in any noticeable difference”. To make that determination you need intimiate knowledge of what the GP does, so that’s why the check would have to be performed inside the GP. The driver just doesn’t have enough information.

Anyway, if it’s cheap to make those old parameters available (they would have to be stored somewhere in between invocations with the same “object”), then it would save a lot of cycles. In addition, if the vertex program doesn’t do much, it might even be possible to cache the output from the vertex shader.
But it’s icing on the cake.
EDIT: come to think of it, those parameters could just be stored by the caller and manually supplied as extra uniforms, so it does seem cheap.

Overmind · July 13, 2005, 5:17am

The problem with that is you have to assume communication between different invocations of the geometry program is possible. This would mean different invocations of a geometry program have to have a defined execution order, so they can’t execute in parallel…

Also I’m not sure if there are really that much savings. Imagine a bezier patch (again ). Either the control points of the patch are moving, then everything has to be calculated anyways. Or they are not moving, but why use a geometry shader then? It would be much faster to just precalculate everything an store in a static VBO.

T101 · July 13, 2005, 5:47am

As I said, it’s icing on the cake.

But for static control points, you still have the small matter of more tesselation when you’re closeby.
The check could be even simpler: the calling program can determine whether or not something important as changed (control points or camera position).
That makes two parameters you always need in calling a GP: a handle to identify the object, so the driver can retrieve the cache results, and a flag that says that the cache is still valid (or that it is invalid). Indeed you don’t even need uniforms in this case, or a hint.

I’d put it in a parameter because the driver might choose not to cache anything, and in that case you automatically want to rerun the GP.

To make that a little easier to understand, something like this:

glGetGeometryObjects(&id,6); // allocate cache for 6 objects - may be a noop
...
glUseProgram(my_program_id);
glVertexPointer(....);
glActiveTexture(...);
glBindTexture(...);
glProcessGeometry(id[0],false); // force recalculate

Zengar · July 13, 2005, 10:36am

Slightly offtopic : DirectX next was proposing something like this and that’s why I decided to post it here - OpenGL needs a counter-weapon
Still, I’m very eager to know how they design it

About cash : let’s look back on my last parralelised code. What if one defines a vertex and a primitive yielfing functions, while the primitive function would invoke the vertex issuing function and everyone will be happy as this can be cashed At the end, there must be a main function wich will yield primitives(provide interpolation rules) - it will close the circle

T101 · July 13, 2005, 9:51pm

Do you mean something like glBegin(GL_GEOMETRY)?

The only issue I see with that is that if you have a source array instead of a source stream, you can more easily interpolate between edges of one strip and the next.
Think of bending the edges of a source triangle when smoothing something: you need to do that to the neighbours’ edges too.

With a stream you need to supply all the necessary information as attributes or textures.
Which can of course be done, but it would be easier for content creation if all you needed was the source triangles.

yooyo · July 19, 2005, 3:08am

Uhm… NVidia & ATI can licence Transmeta CPU’s (or something else) and integrate into future GPU’s as add-on CPU. Driver should provide C/C++ compiler for native code for add-on CPU, and implement switch in driver (glEnable/Disable(GL_ONBOARD_CPU)?) to choose who’s going to feed GPU… main CPU on add-on CPU?

yooyo

MarcusL · July 23, 2005, 10:07am

Hmm… Interesting discussion.

However, I somehow get the feeling that the upcoming multi-core CPUs will be able to do much of the processing that we’re yearning for here.

Anyway, it seems to me that the problem here is how to make the geometry processing general enough to be usable but still constrained enough to be efficiently implemented on the GPU (in terms of parallellisation, etc).

If one had a limited set of output vertices (say 16 or 32 or something) that one could reuse when issueing triangles, I think this could be made to go pretty fast. I definetely think we want to avoid writing vertex data to vram and reading it later on (That’s one of the main point of having it on-chip, right?).

Also, I do not know enough about parallell geometry algorithms to say that it is worthwhile to put this stuff on the GPU. In order to gain efficiency, we need to map this to the stream-processing, SIMD semantics used, and I don’t see that happening for the common processing tasks (especially if one want’s to avoid redundant work), but that might just be me being ignorant of the latest algorithmical advances.

Overmind · July 23, 2005, 12:58pm

The main problem that needs to be adressed is not computation power but bus transfer. A multicore CPU doesn’t solve this, neither does a geometry program that can only output 16 vertices.

The main idea is that you have to transfer as little data as possible over the bus and the rest is calculated by the GPU. For example, transfer a control mesh only and the GPU makes a detailed patch out of it…