NVidia: Where has GL_EXT_vertex_weighting gone?

Originally posted by Korval:
True though it may be, is your CPU doing anything but T&L? A GeForce 256 can get around 4-8M lit tris, but it frees up the CPU significantly.

With respect to two-matrix blending, we found that if properly interleaved, CPU/SSE-based blending could fit neatly in parallel with the rendering of the blended verts. So for a series of characters, blend A, draw A, blend B, draw B, etc…, the “draw A” and “blend B” phases could be done more or less in parallel. This required large batches of VAR’d verts to be rendered with a single glDrawElements call, which isn’t hard considering the blending puts the verts in a single coordinate space and the textures were pre-combined.

Someone may point out that the CPU could be doing something else during that time too, so it’s not truly “free.” But in that single-threaded app, without a fine-grain list of schedulable tasks, blending N verts and drawing N verts in parallel was a nicely balanced pair of activities.

Avi

Originally posted by Cyranose:
[b] With respect to two-matrix blending, we found that if properly interleaved, CPU/SSE-based blending could fit neatly in parallel with the rendering of the blended verts. So for a series of characters, blend A, draw A, blend B, draw B, etc…, the “draw A” and “blend B” phases could be done more or less in parallel. This required large batches of VAR’d verts to be rendered with a single glDrawElements call, which isn’t hard considering the blending puts the verts in a single coordinate space and the textures were pre-combined.

Someone may point out that the CPU could be doing something else during that time too, so it’s not truly “free.” But in that single-threaded app, without a fine-grain list of schedulable tasks, blending N verts and drawing N verts in parallel was a nicely balanced pair of activities.

Avi[/b]

I second this. Interleaving CPU and GFX can give great performance. I have had occasions where this even pays off things simple like a texgen.

Interleaving CPU and GFX can give great performance.

If you don’t care about anything other than rendering a scene, of course. If, however, you are interested in, say, running a physics simulation or some kind of game, the CPU time is critical. It is important to these applications that as little CPU time is spent on graphics tasks as possible.

Someone may point out that the CPU could be doing something else during that time too, so it’s not truly “free.” But in that single-threaded app, without a fine-grain list of schedulable tasks, blending N verts and drawing N verts in parallel was a nicely balanced pair of activities.

Except that it takes CPU time that could, if well written, be spent on other tasks. Even other rendering-based tasks (creating matrices for the next character, generating state, even compiling dynamic shaders).

Originally posted by Korval:
Except that it takes CPU time that could, if well written, be spent on other tasks. Even other rendering-based tasks (creating matrices for the next character, generating state, even compiling dynamic shaders).

I think you may need to define “well written” in this case. For example, in most systems I’ve seen or worked on, character matrices, IK, physics, state determination/sorting and dynamic shaders are all necessarily resolved well before rendering begins. I’m sure it’s possible to design versions of those that work in parallel, but it could have other implications, such as adding a frame of latency…

The core of it is that scheduling arbitrary (or even dynamically picked) tasks to fit between draw calls is extremely difficult to make work consistently in your favor. If you miscalculate, you can wind up not gaining anything or causing the GPU to wait, which is the worst case. Relying on the windows scheduler (for those of us who don’t use a real-time OS), doesn’t add much when the time slices are very small. What seems to work best (IMO) is finding balanced, predictable CPU and GPU tasks and pipelining those together. These are very small time-slices and so something like a 50-cycle per vertex times N vertex task, where the verts are computed, written to VBOs, and rendered immediately is very predictable and pipelinable.

As for GPU/CPU tradeoffs, keep in mind that using a few more CPU cycles in parallel with the GPU may make the rendering faster (by using simpler, even NOP shaders on occasion) and therefore free up more big chunks of time at the beginning and ends of the frame for other less pipelinable tasks, such as physics.

Avi

[edit: read NOP in this case to mean minimal, not literally no instructions. A shader needs to at least set the output vertex fields and perhaps do lighting here if nowhere else.]

[This message has been edited by Cyranose (edited 10-24-2003).]

Originally posted by Cyranose:
[b] These are very small time-slices and so something like a 50-cycle per vertex times N vertex task, where the verts are computed, written to VBOs, and rendered immediately is very predictable and pipelinable.

[b]

I did exactly the same. 50 CPU cycles per vertex allows for some dot products for texgen, or for lighting, or softskinning. Then the results are written into a STREAM_DRAW buffer and streamed together with the static rest, vertex positions for instance. This approach has given me the fastest overall throughput I could ever get. Not only peak performance for untextured, unlit triangles during Z-layout, but the same performance for any render pass.

Regarding to multiplexing CPU time I have had this idea, but not tried anything in this direction yet. What I had in mind is a kind of multiplexer that has a list of tasks to do (for instance, calc physics) and the renderer can yield cycles to the multiplexer whenever the CPU has to wait for GFX. Not coded as multithreading, just cooperative with timeouts.