Geometry programs

system · July 8, 2005, 6:54pm

Zengar, it’s not clear how many times something like that would be called and what would trigger the call. At least you didn’t mention it.

I’m assuming in Overminds primitive assembly program (actually this isn’t good because primitive assembly happens after vertex transform), it would happen for every triangle.

As for LOD, it’s important to know how far the object or portions of the object will be from the viewer.
Might as well unite the vertex transform stage with the tesselation/evaluation stage.

Zengar · July 9, 2005, 1:38am

Oh, I see…

Well, one should set the data arrays up, for example with glVertexAttribute or such, then some command like glProcessGeometry(int first, int count) may be called which invokes the geometry shader. It completely skips the opengl primitive setup as primitives will be created in the shader. Every vertex created by geometry shader will be passed down to the vertex shader just like it would usually be.
Of course, this is a bit ambitious, but I don’t think it would be very difficult to implement and would give the programmer most control. The geometry may be also stored to VBO via some internal method or simply using the render-to-VBO method.

tamlin · July 10, 2005, 9:16am

Could we perhaps simplify this by taking something concrete we all likely know about? A heightmap for a terrain, on a card sitting on the same mobo we are running the CPU program on.

I’m absolutely not suggesting the vertex creation process should be limited to this, I only suggest it as a common start. If that can be worked out, then add other scenarios and have the previous steps to compare with to see if/how it fits.

Imagine a 2D texture that such a program could refer/read, and generate vertices and triangles from (incl. interpolating between heights and other op’s).
(it just struck me, this could completely replace the rigid and seemingly not too hardware-supported ATI npatch/subdivision extension)

One thing that could potentially be cool with this approach, could be if the language had a function to decide if a vertex, or a triangle for that matter is even partially visible (within the current frustum). That way the GPU-running program could itself do culling (only that the culling in this case would be “don’t even generate the vertices”) for a whole range of x or y from the source heightmap texture.

The program would however need other sources of input too - it seems very strange to send e.g. texture id’s as integers encoded in texture objects. Would it have to be varargs-aware (to borrow a C term), or could a vector of e.g. 16 int’s and 16 float’s solve the problem?

This also got me thinking, should this language be able to switch texture objects for texture units? On one hand I’d love it to be able to, but on the other hand I think it’d open a can of worm so large that the GPU would have to now take control of the CPU (just think of textures swapped out).

Perhaps skip the texture names for now, and assume that all textures to be uased are bound to the TU’s used by this program?

Overmind · July 10, 2005, 12:33pm

primitive assembly happens after vertex transform
Yes, I thought about that, this might be a problem. The program would replace the primitive assembly stage, and this would mean it creates some sort of loop in the pipeline, that is, it could create new vertices, which would be fed through the vertex transform again…

The other possibility would be to make a new pipeline stage before vertex transformation and leave primitive assembly fixed function. Then the program would not generate triangles, but vertices, and use the default assembly modes that we have with OpenGL now.

Or replace vertex generation, vertex transformation and primitive assembly with a single programmable stage that generates triangles

Either way, you can’t generate triangles before the vertex stage without completely redesigning the pipeline.

Zengar · July 10, 2005, 9:32pm

Originally posted by Overmind:

The other possibility would be to make a new pipeline stage before vertex transformation and leave primitive assembly fixed function. Then the program would not generate triangles, but vertices, and use the default assembly modes that we have with OpenGL now.
That exacly what I mean I see no use of custom assembly mode but cutom vertices generation would be very usefull

T101 · July 10, 2005, 11:20pm

Not to mention that placing it “facing the CPU” means it’s easier to implement in software.

If this were placed after display list processing, it would COST a lot of performance if it had to be supported in software.
Better to place it parallel to (replacing) or before display list expansion.

T101 · July 10, 2005, 11:33pm

Originally posted by Tamlin:

One thing that could potentially be cool with this approach, could be if the language had a function to decide if a vertex, or a triangle for that matter is even partially visible (within the current frustum). That way the GPU-running program could itself do culling (only that the culling in this case would be “don’t even generate the vertices”) for a whole range of x or y from the source heightmap texture.
I don’t think that would work. Remember, the view frustum is applied after the vertices are transformed into eye space. Which means the vertex shader has to be run first.
Besides, you need to generate the vertex before you can test it against the frustum.

Overmind · July 11, 2005, 4:07am

But here we have a problem with dynamic lod. To dynamically tesselate a surface with higher lod nearer to the camera, you need information that’s only available after the vertex shader… Or you compute all you need yourself, that is, transform the vertices manually, and then use a null vertex shader that just forwards values.

That’s another reason why I wanted to put this program after vertex transformation, not before it…

T101 · July 11, 2005, 4:18am

Then again, to determine how much detail is needed, you only need to know the distance to the camera.
And that can be solved using just one vector parameter.

Besides, the one application of vertex shaders that has real influence on the distance to the camera - skinning on the GPU - doesn’t occur to me as needing different LOD for one side than for the other. Determining the LOD for the entire mesh, then applying joint rotations is good enough if you ask me.
If the user is close enough to notice the difference, you should probably be using the higher level of detail anyway.

system · July 11, 2005, 5:04am

The other possibility would be to make a new pipeline stage before vertex transformation and leave primitive assembly fixed function. Then the program would not generate triangles, but vertices, and use the default assembly modes that we have with OpenGL now.
I’m sure there are some fine details in GPU design that we are not aware about.

I’m not 100% sure how GPUs work, but I think there are 2 implementation for the vertex and primitive assembly stage.
The first is to transform vertices in bulk and write them back to RAM. The assembly stage uses indices to read back the transformed vertices. This may be an old way to do it.
The second is to have a vertex cache and transform vertices as indicated by the indices and feed the primitive assembly stage quicker.

I think that when nvidia created NV_evaluator in silicon, they solved this stuff. They would know what to do.

It is somewhat important to avoid redundent vertex transforms.
Instead of writing floats to gl_Triangle[0].vertex
and gl_Triangle[0].texcoord0[0], it might be better to have gl_Triangle[0].index1, gl_Triangle[0].index2, gl_Triangle[0].index3

If this were placed after display list processing

There is no such thing as display lists processing in GPUs. It’s a driver side issue.

T101 · July 11, 2005, 5:36am

There is no such thing as display lists processing in GPUs. It’s a driver side issue.
If that is so (and I don’t know enough about GPU details to say that it is not), then why is it supposed to be comparable in performance to static VBOs? It isn’t only faster over a network.

It would be nice to hear about this from someone who does have intimate GPU knowledge (maybe Humus can say something about it?).

I can see three possibilities:

all display lists on modern hardware (hardware potentially suitable of geometry shading) is done in hardware - in this case, a geometry shader could safely be placed before the display list expansion, so as not to incur a performance hit if the geometry shader is handled in software.
all display lists on modern hardware is done in software. In this case, a geometry shader must be placed behind display list expansion, so it actually CAN be done in hardware.
some hardware processes display lists in software, some hardware does it in hardware. Worst of both worlds, because there’s no way of choosing the correct spot without creating incompatibilities between the cards.

The safe way of going about it seems to be to not allow geometry programs to be called from display lists, and not allow geometry programs to call display lists. That would make it functionally equivalent to display lists, but with programmability.

Overmind · July 11, 2005, 5:39am

Originally posted by V-man:
It is somewhat important to avoid redundent vertex transforms.
Instead of writing floats to gl_Triangle[0].vertex
and gl_Triangle[0].texcoord0[0], it might be better to have gl_Triangle[0].index1, gl_Triangle[0].index2, gl_Triangle[0].index3

Here we’re at the primitive assembly stage again (generating triangles, not vertices). But of course we could just put this before the vertex transform stage, give it a new name and disable the primitive assembly stage when using it . This would just have the disadvantage that this program has no access to the result of the vertex program, with some inconveniences for dynamic lod…

As I understand it, this program should be able to output many vertices per call. A limitation would not make sense. That’s why I think a function call syntax would be better. Something like a primitive immutable data type ‘vertex’ that can be created somehow and used in a call to gl_Triangle(v1, v2, v3). Immutable because semantically as soon as it’s created it is sent to the vertex transform stage for later use in multiple triangles.

Zengar · July 11, 2005, 6:47am

Originally posted by Overmind:
[b]But here we have a problem with dynamic lod. To dynamically tesselate a surface with higher lod nearer to the camera, you need information that’s only available after the vertex shader… Or you compute all you need yourself, that is, transform the vertices manually, and then use a null vertex shader that just forwards values.

That’s another reason why I wanted to put this program after vertex transformation, not before it…[/b]
If you want it this way, you could pass the camera matrix to geometry shader and evaluate LOD on the fly, before the vertices enter the vertex shader. The problem with your approach is that it is counter-intuitive, the processing chain gets broken.

Overmind · July 11, 2005, 7:57am

Originally posted by Zengar:
The problem with your approach is that it is counter-intuitive
Why? I don’t think it is. You have a program that generates triangles out of vertices. It can also create new vertices and use them in the triangles it generates. Calculations that are done per vertex (no matter if the vertex is generated by the program or not) go to the vertex shader, calculations that are per triangle go to the primitive shader. It’s as simple as that.

Anyway, what’s intuitive and what’s counter-intuitive is subjective. For example many people tend to see the transformation order of OpenGL counter-intuitive, but when you look at it the other way round, it’s completely logical

I think we should get back a few steps and discuss WHAT this program should do before deciding HOW it should do it. Perhaps it was wrong to post example code, because I didn’t intend to start a discussion about syntax. I know I’m not good at designing syntax . I just wanted to show what functionality needs to be covered by giving a concrete example.

Perhaps it’s time I try to make a little summary. As I can see from the previous posts, the major approaches differ in the input the shader gets and the output it produces.

Input:

The program submits drawing commands just like the CPU does now. It has no input stream (just uniforms) and it outputs vertices and/or triangles.
A program that takes vertices as input and produces more vertices and/or triangles as output (tesselation).

Output:

A) The program outputs a vertex stream and/or an index stream, using the default OpenGL primitive assembly.

B) The program outputs triangles, that is, it does the primitive assembly.

Option 2B would again come in two flavours, before vertex processing (untransformed vertices as input) and after vertex processing (transformed vertices as input, produces a feedback loop).

I have the impression that in this discussion the arguments for 1 vs. 2 got mixed up with A vs. B. I mixed it up a few times myself . Perhaps this summary helps, if I got it wrong, feel free to correct me.

My original proposal was clearly 2B, after vertex processing. After hearing some arguments against it, I’m tending to 2A. But the point I was trying to make was that IMHO 2 is better than 1, independant of A vs. B.

T101 · July 11, 2005, 10:16am

Darn. You beat me to it.
Seriously though: the discussion has become confused, so I’ve been typing away trying to summarise what’s been discussed.
So here goes:

Let’s recoup:
[li]1. Possible uses:[/li][ul][li]Dynamic LOD - specifically thinking about terrain[/li] [li]Higher order surfaces[/li] [li]Shadow volume construction[/ul][/li][li]2. Options for placement:[/li][ul][li]A. Before vertex transform - this is more flexible, and may allow software emulation. [/li]But possibly problem with display lists - if either but not both are implemented in hardware.
[li]B. After vertex transform - can probably only operate on triangles and possibly quads, but[/li] can benefit from parallellism, and because of its limited functionality, simple to implement. But cannot be done in software without also performing the vertex transforms in software.
[/ul]
[li]3. Possible functionality:[/li] [ul][li]A1 Replace only evaluator functionality. Generate vertices for a single primitive type (probably trianglelist or quadlist), using current texture bindings.[/li] [li]A2 Replace both display list and evaluator functionality. Pretty much anything that can be called from a display list, including recursion. Advantages: highly flexible, possibility to do preprocessing on the CPU, display list functionality can be implemented using a geometry program, clearly defined relation to display lists, suitable primitives can be started for certain types of geometry (e.g. automatic use of strips or fans). Disadvantages: complex to implement in hardware, possible byte-order issues between GPU and CPU, high resource requirements, possibly memory management issues when binding textures.[/li] [li]B. (After vertex transform)[/li] Custom interpolation/warping in eye-space
[/ul]
[li]4. Input:[/li] [ul][li]A1 Vertex and attribute streams in object-space with attributes and parameters. Possibly one or more 1D/2D/3D/cubemap textures.[/li] [li]A2 Buffer objects - to be initialised at the CPU (both for custom structures and for compiled display lists if driver chooses to)[/li] [li]B. (After vertex transform) Vertex and attribute streams in eye-space with attributes and parameters. Possibly one or more 1D/2D/3D/cubemap textures.[/li] [/ul]
[li]5. Output:[/li] [ul][li]A1 Vertex stream in object-space[/li] [li]A2 Command and vertex stream in object-space[/li] [li]B. (After vertex transform) Edited eye-space vertex stream - possible additions/deletions.[/li] [/ul]
Korval: I’m not sure where writing to vertex attributes would come in, possibly both A1 and B.
[li]6. New constants/functions[/li] [ul][li]GL_PRIMITIVE_SHADER - Overmind’s suggestion for a primitive type to send vertices through the geometry program instead of using a standard primitive[/li] [li]glProcessGeometry(int first, int count) - Zengar’s suggestion for calling the geometry program after using vertex pointers to set up the streams[/li] [/ul]
[li]7. Proposed GLSL built-in types/functions/variables[/li] No doubt just a subset of what will actually be required, very preliminary,
and both V-man’s and Overmind’s suggestions are in here.
[ul][li]vertex - the vector as well as all the attributes. Note: What would be the maximum number and type of the attributes? Also: could this be defined as a simple fixed struct?[/li] [li]Variables:[/li] [li]vec4 gl_Vertex{123} (or a vec3) for sending the coordinates of one triangle (V-man’s suggestion - superceded by gl_Triangle[n] I think?)[/li] [li]gl_Triangle[n] Triangle structure. Contains data for triangles. Containing: int index1,index2,index3; indices into the vertex array/VBO. - Would that be the edited or original array? If edited, how do you keep track? If not, how do you insert?[/li] [li]vertex gl_Vertex[n] to address the input vertex stream - with n=0 for the first vertex of the “current” triangle.Note: how do you advance the index?[/li] [li]gl_TriangleCounter int to keep track of the (total?) triangle count[/li]
Functions:
[li]glBegin()[/li] [li]glEnd()[/li] [li]glNormal()[/li] [li]glVertex()[/li] [li]vertex interpolate(vertex v1, vertex v2, float amount) - interpolate between vertex attributes[/li] [li]void gl_Triangle(vertex v1, vertex v2, vertex v3) - emit a triangle with these three vertices (incl. attribs)[/li] [li]??? gl_ControlMesh(???) - Some way of accessing parameters to the geometry program. V-man, if this is important, please elaborate.[/li]
[/ul]
Feel free to correct. Zengar, could we put this list in the first post or something?

Zengar · July 11, 2005, 4:21pm

Overmind, I dislike your approach because it screws the pipeline. You are proposing - if i understand your idea correctly - a unit after vertex processor that would have the feature to assemble the vertices to triangles or to create new vertices if needed. basically, you create a feedback changing the usual order of the pipeline. Also, I don’t really understand how are you going to do evaluators with them, as you’ll have to generate lot’s of vertices, resulting in a stall once more. One possible solution would be disabling the feedback and pass the vertices from your unit directly to the rasterizer. But then, it would be limited to evaluator/tesselator-like stuff - spoken simply, the unit will just interpolate between processed vertices.
Actually, as I write this, I can see that your idea is indeed interesting, performance/functionality cut(provided we allow no feedback to the vertex shader).

The other question were display list - now I don’t see how that’s an issue. Display lists are compiled execution buffers, that’s all. They don’t store all the state, but only static things, like vertex commands or enables.

Now back on my proposal - it is mere another way to setup geometry data, bypassing the usual pipeline begin. In my system, applciation doen’t send in vertex data but - let’s say, - mesh data, which will be then converted to vertex data by the geometry shader. While this is most flexible system it also has severe disadvantages, like

slow performance(the geometry processor is not parralelised)
inefective when we just need to interpolate(interpolation after the vertex shader would be more suitable)
While the first issue can be possibly resolved by defining some sofisticated rules. A possibility will be to define a function that outputs vertex data based on some interpolated parameters(like vertex id, or mesh control point etc.). If thsi parameters don’t depend on the output(and they shoulnd’t), then we may start more then one instance of our function at the same time - packing every instruction inside a paralellised unit. The adaptation of my previous code would be something like

  
attribute int start;
atribute  int length;

interpolated int i from start to length+start step 1

vertex(){ glNormal(...(i));
    glvertex(...(i));
  }

But of course, it gets a bit awkward

T101 · July 11, 2005, 10:45pm

OK. Here’s how I would probably speed up display lists if I were to write a driver and had no special display list hardware support:

Allocate vertex buffer (in fact a static VBO) and command buffer
loop through the commands:

replacing all the glVertex commands with vertices in the vertex buffer
placing a DrawRange command for every glBegin/glEnd pair into the command buffer
placing all state commands into the command buffer

When calling the list, bind the vertex buffer, loop through the command buffer and execute (so this would still be in software, but the vertices would already be on the GPU, saving cycles)

I hope this illustrates that even if you don’t process the whole display list in hardware, the driver still has options to speed up the processing of display lists by uploading them to the GPU.
In my opinion, that optimisation would be defeated if vertices were the input to the geometry program, and the geometry program was emulated on the CPU.

Now how to deal with it is another question.

Documentation and fallback from VBOs to main memory arrays: You can just specify that if you want geometry programs, you potentially lose display list performance if display lists are used to supply the geometry. With the possible exception of a displaylist calling an array function.
You can specify that direct mode geometry and geometry from display lists is not passed through to the geometry program, i.o.w. you use a function similar to glDrawArrays or glDrawElements to pass geometry to the GP.
You can specify that geometry is generated by the GP instead of being passed through it. So no vertex arrays as input (at least not explicitly).
You can place the GP behind the vertex program - but that means you can’t emulate it without also emulating the VP.
Or you can go all the way and use GP as a replacement (including emulation) for display lists. (Difficult as stated before)

In an ideal world, I would prefer option 5, but that has enormous costs in silicon.
Option 1 would discourage but not prevent direct mode and display list use for passing geometry to the GP.
Options 2 and 3 would prevent direct mode and display list use for passing geometry to the GP,
but still allow the use of a display list to call the GP. Option 2 would still allow the use of vertex arrays as input.
Option 4 would discourage the use of GP on any GPU that doesn’t perform it in hardware (since the price you pay for emulation is software VP).

I’m not sure which I would prefer in a non-ideal world, but I’d have to say it’s 1,2 or 3.

BTW: it seems clear that any kind of buffer to be allocated by the driver for GP use, will have to be specified with that explicit purpose - so you get a main memory buffer if the GP is emulated, and GPU memory if it’s done in hardware.

PS: Zengar, that “severe performance penalty” is only relative to other GPU geometry program options. It’s still an improvement over the CPU option since there is less main bus traffic involved (provided you’re not getting software emulation of course).

Overmind · July 12, 2005, 6:10am

Zengar, I like your last code sample

Perhaps something like a functional list processing language would be a good idea, because that would allow parallelism, too.

My original proposal would only work if we would allow the feedback loop to the vertex processing, but you’ve already convinced me earlier that this is no good idea.

But it is possible to place the same program before the vertex shader. Just take away the ability to read varyings and add the ability to write attributes. The interpolate method would still be useful, but there would be a vertex constructor, too.

The basic idea is that you have a program that takes a “mesh” as input and writes some vertices (or triangles) as output. The program is called once per mesh. But what should this “mesh” be? Clearly it should have some attributes. But we already have such a primitive: a vertex. This is why I proposed using vertices as input for the program. It’s not neccesary that the input vertices have anything to do with the output vertices, they could just contain arbitrary parameters used in the program.

About display lists: I don’t know what problem everybody has with display lists. They are just macros. If a display list doesn’t use geometry programs, there’s no difference. If geometry programs are hardware accelerated, so are display lists. And if geometry programs are in software and a display list uses them, there is no point in trying to execute the display list in hardware, as that’s not possible, no matter where in the pipeline geometry programs are located.

Finally some words about the list in the first post:

A1. Replace only evaluator functionality.
NO! Don’t confuse evaluation with tesselation. An evaluator is something that takes (u,v) as input and outputs (x,y,z). That’s the job of a vertex shader. The generation of a grid of (u,v) values in the range (0,0)-(1,1) is tesselation, this is what the geometry program should be able to do, and that’s what’s described in the second sentence of A1 (generating vertices).

A2. There are no byte-order issues.

One addition:
glEnable/Disable(GL_GEOMETRY_SHADER)
When enabled, the vertices that are submitted by the usual methods (immediate mode, vertex arrays) are fed to the geometry shader.
IMHO no need for an additional call…

vertex: It would be a struct that has exactly the members that the vertex program expects as input attributes.
gl_Vertex[n]: The index would advance with every invokation of the program, how far could ideally be set somehow, with the simplest solution to just increase by the number of vertices that are used by the program (no overlap).

tamlin · July 12, 2005, 8:46am

While I (obviously) agree with the idea that a geometry program should be able to create vertices (hell, AFAIK I wrote about it first! (though I might be wrong) ), I do not agree with then idea en/disable should affect vertices sent to the server.

In my idea, this would be a program that could run on the server and generate geometry within given matrices. As such, it would be obvious for the hardware to answer the question “is this vertice within the frustum?”. I’m completely diregarding vertex programs here, that might actually put a vertex into frustum again. Perhaps vertex programs should run for every vertex generated by a vertex program?

My idea has so far “only” been to be able to create vertices (and triangles, normals, s/t aka u/v coordinates etc) like we can today do on the CPU based on exsiting data already uploaded to the server anyhow.

To get back to what I proposed as a least common denominator, say we have a 16x16 terrain heightmap uploaded as a texture (object).

My idea is that this new program should be able to iterate x and y of that texture (s&t) and greely generate geometry from it.

What I might have come to see, is that the host needs to allocate memory sometimes. Imagine you want this program to generate veertices for a 64x64 patch, then you’d better have a 64x64 VBO allocated for it to use.

While it does put burden on the caller, the caller is after all the one defining the program, and if it requests “tesselate like ATI npatch *3”, then you better have both the vertex and index space allocated, else it’d fail.

Still, I think this should go/be defined at the same level as a current glVertex call (even that I suggest it should have VBO write access).

I still think we should start with a simple x/y heightmap texture used just for geometry and work our way from there, just to see where or even if this fits in.

T101 · July 12, 2005, 9:28am

Don’t confuse evaluation with tesselation.
Fair enough. I just looked at the diagram and saw that evaluators are placed before the vertex transform, and after display lists.
I figured since evaluators are used for glu’s NURBS functionality (at least according to the redbook), that this functionality is similar to evaluators - but more flexible. The point is, though, that direct mode and display lists are placed before this functionality.

I don’t know what problem everybody has with display lists

Strangely enough, everybody does not have a problem with display lists. Only I seem to have one (and I realise that geometry in a display list is easily replaced with a vertex array).
However, if display list geometry is put into VBOs “under the hood”, that prevents software emulation of geometry programs if display list geometry is accepted. Which is “just macros” for direct mode - which you yourself suggested should be directable to the geometry program with a glEnable.
I described a number of ways to deal with it in my previous post though.
One is not to pass direct mode/display list vertices but to use an explicit call similar to DrawArrays for GPs.

A2. There are no byte-order issues.

How can you be so sure? If it’s a buffer with arbitrary data (which is probably necessary for such an enormous bit of functionality), that buffer would be filled by the programmer, not by the driver.
The point is kind of moot, because it’s prohibitively expensive to put that much functionality into hardware. But no harm in pointing out potential issues - just shows that option A2 was considered but deemed impractical.