Acronym city: H/W T&L vs. S/W transforms

Okay, I really need to bounce some ideas around and nobody in the room here has the slightest idea of what I’m talking about - so I’m hoping you guys might know

I want my program to take advantage of hardware T&L, so naturally the position of each object in the view frustum is calculated by glMultMatix. My ‘problem’:

I’m using my own picking algorithm and really I’d like to test against the lowest detail mesh of a given object. Of course, these are not transformed outside of the OpenGL pipeline. So I have to do my own transformations on them. This is kinda contra to the whole point of having hardware H&L in the first place. Are the glGet mechanisms and feedback buffer really that slow? I know I will have to use the CPU to transform my low detail meshes, so would using bounding boxes be better instead? Any tips to speed that up?

This got me worried about my ‘software culling’ too. What about glGetFloatv on the modelview and projection matrices? Is that potentially slow too? So I be using my own math to figure out these matrices for my plane equations?


The 3D pipeline is a one-way street. Vertices, textures, and state go in, pixels come out. Selection and feedback directly contradict this paradigm. We implement selection and feedback in SW, and I presume that every other implementor except for a few old SGI machines does the same. We have no plans to ever accelerate these features.

If you’re doing your own picking, it’s better to implement your own algorithm anyway. That way, you can save time with bounding box/sphere tests, or better yet, hierarchical bounding box/sphere tests. You can also do direct intersection tests with certain types of objects (spheres, cylinders, etc.) rather than converting them into polygons. You also should be using the low-detail models for these.

I consider picking to be a subset of the general game problem of physics: collision detection, collision handling, intersection testing, simulation, …

As such, it is clearly handled best by the application and by the CPU.

glGetFloatv won’t be too slow for us, although you should be warned that it hits the world’s largest switch statement, which will kill your branch prediction. In general, you shouldn’t assume that these are fast – we might decide to not keep any cached state on the CPU at some point in the future, at which point we’d have to do readbacks of the HW state. That’d be very slow.

So I’d recommend recalculating the matrices in SW rather than getting them back from OpenGL.

  • Matt

Okidokie Matt, as always - thankyou very much

[This message has been edited by Pauly (edited 11-28-2000).]

It’s funny (and offtopic), but I’ve thought of this before:

Doing a quick (probably wrong) calculation, Nvidia claims 30 Million tri’s/sec. Assuming this is geometry limited (which I’ve heard it isn’t) and triangle strips, that means 10 million floating point matrix vector multiplies per second. Total, this is equivalent to 2.5 million 4x4 matrix multiplies per second. If there is hw support for 16 lights, that’s enough math for another 2.5 M matrix mults per second. So, conservative estimate is that the GPU can do 5 million 4x4 matrix multiplies per second. That’s over half a gigaflop (640 million flops), which is probably equivalent to the best pentiums that run at four times the clock speed. There must also be a lot of left over silicon for rasterization and such.

To make a slashdot joke, imagine a beowulf cluster of those! Seriously, putting 4 of these in a computer’s pci slots would be cheaper than a 4 way smp board, plus they come with at least 32 megs of RAM per node. Has Nvidia thought of selling a modified graphics card like this to the scientific community? The stuff I used to work on with lattice qcd basically just multiplied 3x3 complex matrices constantly. If I remember right, a 3x3 complex matrix has a real 8x8 representation, which could be split into 4x4 matrix multiplies. Just hardware accelerate that getFloatv() and you might have a huge audience in the supercomputer market.

Just a thought


Something for Matt and his work buddies to think about

Umm… I’m sure they know how many mults
their hardware does, already :slight_smile:

The trick is, those are specialized mults.
You can’t take them out of their context.
There is limited to no control flow
capability. The memory subsystem is designed
ONLY for streaming in textures and vertex
data. Etc etc etc.

Outperforming Intel CPUs on floating point
at lower clock speeds has been done for years
and years; even the PPC does it (assuming
you can get the data in there quick enough,
which the Motorola chipsets typically can’t)

For impressive general-purpose floating-point
performance, look up the new TI C6xxx DSP
chips, or the AD TigerSharc. Those are
terraflop-class devices. Yummy! But I bet
they’d suck at rasterizing pixels :slight_smile: