NV_primitive_restart anyone?

namespace · June 7, 2004, 6:00am

Hi!

Has anyone tested the GL_NV_primitive_restart Extension?

The possibility to insert glDraw-Calls in Vertexdata looks like a real improvement to me.
Unfortunatly im not able to bench it (n30 emulation), just want to know if its worth to spend some time on it.

plasmonster · June 7, 2004, 1:05pm

I have used it, and it’s a thing of beauty, a simple, elegant solution.

Unfortunately, it doesn’t look like there’s going to be widespread support for this anytime soon, AFAIK. That’s the only problem I would see with being dependent on it.

Obli · June 8, 2004, 2:51am

Originally posted by Sean:
[b]I have used it, and it’s a thing of beauty, a simple, elegant solution.

Unfortunately, it doesn’t look like there’s going to be widespread support for this anytime soon, AFAIK. That’s the only problem I would see with being dependent on it.[/b]
Please, can you tell me how much faster it goes?
It does not need to be accurate. “Somewhat faster” or “faster enough” may be enough.
I have tried to use it but it didn’t fit my datapath so I decided to drop it but I would like to hear those rumors.
Thank you in advance.

imported_Adrian1 · June 8, 2004, 3:15am

If your app is CPU limited and makes hundreds/thousands of drawelements calls per frame then you may benefit from this extension. It’s difficult to quantify how much difference it makes because it depends on many factors.

There is a small but significant CPU overhead each time you call drawelements(or similar). This extension allows you to reduce the number of those calls.

I havent used this extension yet but it looks extremely useful.

plasmonster · June 8, 2004, 4:27am

@Obli
It was faster in the context of a LOD terrain. I’d like to be more specific, but I’ve since done away with the extension, growing weary of waiting for the ARB version. But I would use it just for the facility of primitive batching, without the performance perk.

You could get a big win for LOD terrains, though. Some algorithms require the inserion of degenerate tringles in order to form a single continuous tristrip (Lindstrom, et al). This is where it could really come in handy.

I’ve often wondered if it would be possible to extend this to accept a primitive type after the restart index.

End();
Begin(new_type);

This might make for some nifty batching possibilities, most notably in the context of generic meshes, which are rarely uniform wrt primitive sequencing.

namespace · June 8, 2004, 11:03am

The big chance I see is that triangle strips finally become effective, compared to a simple, but cachekilling triangle list solution.

Korval · June 8, 2004, 11:49am

I’ve since done away with the extension, growing weary of waiting for the ARB version.
Why would you expect an ARB version? This is something that has to be built into the hardware specifically, and the choice for the restart value can easily be different depending on the hardware.

The big chance I see is that triangle strips finally become effective
I don’t know what strips you’re using, but they have almost always been effective. More effective than lists.

namespace · June 8, 2004, 1:14pm

I don’t know what strips you’re using, but they have almost always been effective. More effective than lists.
Having multiple Draw-Calls for every mesh can produce a “nice” overhead in the engine and driver.
Add some more Draw-Call because you have to change textures/shader for a submesh and you (i.e. me )
end up with too many Calls and small batches.

plasmonster · June 8, 2004, 2:42pm

Why would you expect an ARB version?
Most of the good ideas eventually make their way to the ARB. It just so happens that I think this extension is a particularly good one. I take it you do not agree?

This is something that has to be built into the hardware specifically, and the choice for the restart value can easily be different depending on the hardware.
I agree that each implementation will have to deal with the particulars, this is certainly the case for any extension. But I fail to see why this should dash my hopes for the extension making it to the ARB. Can you explain?

And as the restart index can be any user defined 32 bit value (nVidia spec), I don’t understand why this would hinder an implementation, since conformant implementations have to support uint indices anyway.

Korval · June 8, 2004, 4:31pm

Having multiple Draw-Calls for every mesh can produce a “nice” overhead in the engine and driver.
Add some more Draw-Call because you have to change textures/shader for a submesh and you (i.e. me )
end up with too many Calls and small batches.
Unless, of course, you stich all your strips together with degenerate triangles. And, if you’re doing material changes in the middle of your mesh, you already have problems that no primitive restart can solve.

I take it you do not agree?
It’s a fine idea, but the only implementation that gives good performance is one where the hardware actively understands the restart command. It’s like VAR; it’s very specific to a particular hardware implementation.

And as the restart index can be any user defined 32 bit value (nVidia spec)
Really? I haven’t read the spec in a while (I don’t use nVidia hardware at the moment, so NV extensions aren’t something I keep up with).

Even so, one would still need to make hardware to do this. To be honest, there are more important performance problems to be solved before taking this one on.

Ozzy · June 9, 2004, 1:12am

Well, primitives initialisation is quite expensive anyway… that’s why in most of the case the best cfg is 1 prim = 1 strip to get the best performances. Too bad that we can’t specify/load a modelview matrix in conjonction of nv_primitive_restart, that would be a severe plus regarding performances for identical primitives at != locations for instance. Maybe there is a trick for this kind of situation but it doesn’t seems possible regarding the specs. Btw, why this kind of feature can’t be added if this nv ext goes ARB? what are really the limitations behind this when u only need to apply a new modelview while the rest of the primitive description remain unchanged?

Jan · June 9, 2004, 1:44am

My engine internally works with polygons, instead of triangles. I “triangulate” them when filling the buffer for a DrawCall, by simply sorting the indices in such a way, that each polygon gets send as several triangles.
This has some nice advantages, since for collision detection, etc. i can work with a lot less polygons, than i would have to work with triangles, and by reusing the vertices of a polygon for several triangles, i can make heavy use of the pre- and post-T&L caches.

Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.
I am not sure, if i would get a speedup in this case, so what do you think ?

On the other hand, in the spec it is said:

Is it feasible to guarantee fast performance even in the non-VAR, non-CVA, non-DRE case??? Possibly not.
So? What about VBOs ? I don´t see why this should be restricted to VAR.

Jan.

zeckensack · June 9, 2004, 10:04am

Originally posted by Jan:
[b]My engine internally works with polygons, instead of triangles. I “triangulate” them when filling the buffer for a DrawCall, by simply sorting the indices in such a way, that each polygon gets send as several triangles.
This has some nice advantages, since for collision detection, etc. i can work with a lot less polygons, than i would have to work with triangles, and by reusing the vertices of a polygon for several triangles, i can make heavy use of the pre- and post-T&L caches.

Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.[/b]
No, they are certainly 6 indices instead of 4, not vertices.
With NV_primitive_restart it would be 5 indices, and that’s it. Everything else (namely vertex traffic, index numeric range, post-transform cache hits) is the same. I really wouldn’t expect to gain anything in that special case.

It starts to make more sense for batches of larger polygons, and for strips and fans. For a convex polygon, if n is the vertex count, you need (n-2)*3 indices if you “tesselate” it into indexed triangles. But still, the only thing you can really save by using this extension is index traffic.

system · June 9, 2004, 3:17pm

Originally posted by zeckensack:

It starts to make more sense for batches of larger polygons, and for strips and fans. For a convex polygon, if n is the vertex count, you need (n-2)*3 indices if you “tesselate” it into indexed triangles. But still, the only thing you can really save by using this extension is index traffic.
The idea is to reduce function call overhead, kind of like the multi_draw_arrays extension.

But instead of making multiple calls with the primitive being GL_TRIANGLE_STRIP, or make use of dead triangles, you can expand the index array and make a single call with GL_TRIANGLES beeing the primitive.

How many glDrawElements (or whatever) calls are you making per model?

IIRC, using dead triangles doesn’t cost too much now on NVidia. Not sure about the others.

What is recommended in this area with the next gen hardware?

plasmonster · June 9, 2004, 6:00pm

For the intersted reader:
http://www.nvidia.com/dev_content/nvopenglspecs/GL_NV_primitive_restart.txt

To be honest, there are more important performance problems to be solved before taking this one on.
@Korval

I’d be willing to go along with that, if batching wasn’t such an important issue, and IHVs were unable to walk and chew gum at the same time (work on more than one pipe issue at a time) .

But your point is well taken: The extension is more or less targeted at triangle strips at this point and, as such, is not a generic solution, and not likely to get the lion’s share of attention across the board.

Now i was thinking, if it could speedup the engine, when i use the primitive_restart extension (although at the moment i have a Radeon) and send polygons instead of triangles, because most polygons are actually quads, and each quad gets split into 2 triangles, which is 6 vertices, instead of 4.
I am not sure, if i would get a speedup in this case, so what do you think ?
@Jan
The best way to utilize this extension is to group your geometry by strips or fans, inserting a restart index where there’s a break. It will work with triangles and quads too, but there’s not much to gain from it. Unfortunately, at this time, you can’t change the primitive type amid stream (that sure would be cool, IMHO).

The idea is to reduce function call overhead, kind of like the multi_draw_arrays extension.
@V-man
This is a big win indeed, if you have lots of geometry.

IIRC, using dead triangles doesn’t cost too much now on NVidia. Not sure about the others.
Thats’s a good point. Today’s hardware can handle degenerate triangles quite handily.

What is recommended in this area with the next gen hardware?
Well, batching is likely to be a huge issue for the forseeable future; the question is then will this extension, or its ilk, be in it. I wish I knew the answer to that one. Maybe there’s something leaner lurking out there.

Korval · June 9, 2004, 6:49pm

if batching wasn’t such an important issue, and IHVs were unable to walk and chew gum at the same time (work on more than one pipe issue at a time) .
Batching isn’t that important of an issue on OpenGL. You have some overhead for calling glDraw* 5 times rather than 1 (more than mere function call), but good VBO use is far more important than that. The marshalling of GPU commands is very good these days.

If hardware makers could get state changes to be less costly, that would go much more to rendering performance than any primitive restart.

plasmonster · June 9, 2004, 11:46pm

Batching isn’t that important of an issue on OpenGL.
I disagree. I see batching as among the biggest problems in the future of graphics. As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.

You have some overhead for calling glDraw* 5 times rather than 1 (more than mere function call), but good VBO use is far more important than that.
We’re talking about thousands of calls here, possibly much more. It depends on many factors.
But this extension is orthogonal to VBOs. It’s not enough to simply give the driver the vertex data; you have to tell the driver what to do with it. VBOs are a great way to manage data, but you still have to issue draw commands.

If hardware makers could get state changes to be less costly, that would go much more to rendering performance than any primitive restart.
I agree. But why can’t we have both? The batching issue has to be addressed by someone. I can do everything possible to optimize my side, but eventually, I have to tell the driver what to do. This extension simply makes that communication more efficient.

Alas, the point of this discussion is probably mute, as there doesn’t seem to be any sign of ATi joining the throng, AFAIK.

BTW, I never meant to suggest that this was a cure all. I just think it’s a pretty darn good idea.

Korval · June 10, 2004, 12:29am

I see batching as among the biggest problems in the future of graphics. As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.
You seem to misunderstand my point.

Let’s say you have a program with plenty of CPU time to spare. So, you decide to change the stripping. For every 1 strip, wherever possible, you split it up into 5. Hence, you will need to call glDraw* 5x more than before.

Assuming that the program was vertex transfer limitted to begin with, the performance will drop primarily because of caching behavior dealing with vertex data. That is, the card works best when it reads a long, unbroken string of indices. You can mitigate this somewhat easily enough by putting the indices into a contiguous array. At this point, the performance penalty comes from only 3 potential places:

1: Function call overhead on glDraw*. In our case, we have plenty of CPU time, so this is negligable.

2: Driver marshalling of GPU commands. The boneheaded way of implementing glDraw* is to immediately put the commands into the GPU’s FIFO, which could require a switch to Ring0 on the CPU (a slow operation). Few GL drivers do it this way. Drivers marshal GPU commands pretty efficiently these days.

3: Some oddball GPU problem. For whatever reason, the GPU has some significant delay between primitive batches. I have no factual, or even speculative, reason why a significant delay would exist.

1 is trivial, 3 doesn’t exist, and drivers are pretty good at 2. Where’s the batching problem?

Now, you might have read a PDF on nVidia’s site about the importance of batching primitives. They suggest taking drastic measures to get large batches of primitives, because a 1GHz CPU only gets something like 10,000 batches. This PDF only refers to D3D, because D3D can’t do #2 well at all. It has to use the “boneheaded” method, because of how the D3D driver model works. GL drivers can, and do, perform appropriate marshalling of GL commands.

None of this is to say that you can send a mesh as a sequence of 1-triangle-sized glDraw* calls. While #3 may not be significant, it is still there, and for rendering large numbers of polygons, it can add up quickly. But, for real numbers, it is quite negligable.

Note that this assumes the use of VBO index buffers as well as ATi hardware. I’m not sure about FX hardware, but I do recall that nVidia hardware through the GeForce 4 definately had issues with the concept of index buffers. While they clearly support VBO index buffers well enough, it is clearly stated that the buffer object containing indices should be a different object from the actual mesh data, as this allows for implementations that can’t handle video/AGP memory with indices. The general assumption about this level of nVidia hardware was that the driver, upon receiving a glDraw* command, was required to copy the given indices directly into the FIFO/Marshal queue, which obviously doesn’t work well if they are in video/AGP memory.

ATi hardware, of R300 calibur or better (if not R200 hardware), doesn’t have this limitation. As such, all it needs to do is copy a 16-32 byte instruction opcode sequence into the FIFO (telling the GPU where the index buffer is, how long it is, and the format) for each glDraw* operation.

It is likely that NV30 fixed the nVidia issue, since NV30 supports primitive restart, which presupposes a better command processor/primitive unit.

The batching issue has to be addressed by someone.
The batching issue is, to my mind, resolved with degenerate triangles in strips. With the exception of one thing: Triangle Fans. I would dearly love to fan my terrain, but I can’t due to the performance impact.

As such, the only times I make multiple glDraw* calls are for either particles or for state changes. My batches tend to broken up by state change far more than by anything else.

harsman · June 10, 2004, 2:28am

Originally posted by Sean:
As worlds and characters get ever more complex, issuing all the draw cammands will weigh heavily on performance.

As worlds an characters get more complex, the number of triangles obviously increases which means more triangles per draw call, no?

Maybe I didn’t understand your point. I agree that ways to decrease batching overhead like varying vertex stream frequencies and instancing would be useful for many reasons but I really don’t think increasing geometric omplexity is one of them.

zeckensack · June 10, 2004, 6:34am

You shouldn’t be comparing NV_primitive_restart supported rendering of fans, strips and polygons to “normal” rendering of the same.

Rendering large numbers of these primitive types may be prohibitive for whatever overhead there is, but then, you shouldn’t be doing that anyway. What you do is use GL_TRIANLGES as primitive type, and use indices. This is far more efficient, and it is what you should use as a baseline of comparison when you try to figure out what NV_primitive_restart can do for you.

And if you use degenerates, there is no issue to begin with.

PS: the notes about indirect rendering seem very plausible. We’re talking about direct rendering here, right?