T&L

Just a thought, hope someone has an answer :
In Quake3, like in all 3d engine, transformations are done by the engine (ie not by opengl’s matrices) to walk thru the BSP, or sort by depth the transparent faces, etc… So the question is : how does Quake3 do to take advantage of hardware t&l ?

Joker : if I leave matrices as the identity, does opengl perform the transformations anyway, or does it skip this step, knowing the vertices would remain the same ?

Originally posted by Antoche:
if I leave matrices as the identity, does opengl perform the transformations anyway, or does it skip this step, knowing the vertices would remain the same ?

Good OpenGL implementations do it this way. But I don’t know which implementations. :slight_smile:

Kosta

The only way to take advantage of transform and lighting in hardware on opengl is to use opengl API. So quake 3 has just some code that detect if the card is a GeForce. IF it is than it uses opengl API. If not, it uses the SSE, 3DNOW, or “by hand” optimisation for transformation.

If the implementation is clever enough, it won’t calculate the transformated vertex. Even though it’s only going to be 4 multiplication, so that shouldn’t not really bog down the system a lot if it does the multiplication. FPU is still quite fast.

So, if QuakeIII uses OpenGL API to take advantage of hardware t&l, how does it to retrieve the transformed coordinates of the vertices ??? I believed it wasn’t possible. And don’t tell me it uses feedback mode

I don’t understand why you want the transformed vertices. You don’t need them to walk through a bsp tree. Anyway the whole point of using a bsp tree is to have the smallest possible number of vertices TO BE transformed.

The thing you need to walk through a bsp is the position and direction of the viewer.

Well, if i understand the principes of the BSP, the aim is to check if the bounding boxes of each node is in the clipping volume, and if no, all children are discarded. So to check if a bounding box is in the volume, i first must transform it, don’t I ?

The only way to take advantage of transform and lighting in hardware on opengl is to use opengl API. So quake 3 has just some code that detect if the card is a GeForce. IF it is than it uses opengl API. If not, it uses the SSE, 3DNOW, or “by hand” optimisation for transformation.

The transformation is always done through the OpenGL API, and Q3 does not do it any other way … so if the hardware supports T&L it’s gonna be used. You can’t do your own transformation. The lighting however can be turned on and off and Q3 doesn’t use OpenGL:s vertex lighting by default since it’s not as nice as lightmapping …

If the implementation is clever enough, it won’t calculate the transformated vertex. Even though it’s only going to be 4 multiplication, so that shouldn’t not really bog down the system a lot if it does the multiplication. FPU is still quite fast.

Hmm … it’s more than 4 multiplications … a 4x4 matrix multiplied with a vector it’s gonna be 16 multiplications and 12 additions (and adds are even slower than mults with floats …)

You can’t do your own transformation

Yes you can! Simply multiply all your vertices by your transformation matrix. This transformation matrix can be optimized to put rotations around 3 axis, scaling and translation in one. Instead of using 5 opengl calls.

Then you just draw all your faces using the transformed vertices and the identity matrix as the modelview matrix.

Originally posted by Humus:
Hmm … it’s more than 4 multiplications … a 4x4 matrix multiplied with a vector it’s gonna be 16 multiplications and 12 additions (and adds are even slower than mults with floats …)

I’ll just be a little more precise. If you do your own translation and rotation, than your model view matrix is the identity matrix, wich gives only for multiplication that are added for the transformations. Of course you still have to multiply with the projection matrix, but that wasn’t included in what I was talking. But still you could do your own projection calculation and a projection Matrix equal to identity!!

And yes you don’t need opengl to do your transformations. You just rotate and translate by using your own matrices(or quaternions!) and then send that transformed vertex to opengl.

[This message has been edited by Gorg (edited 07-02-2000).]

Originally posted by Antoche:
Well, if i understand the principes of the BSP, the aim is to check if the bounding boxes of each node is in the clipping volume, and if no, all children are discarded. So to check if a bounding box is in the volume, i first must transform it, don’t I ?

You can either transform all shapes and check them against a fixed clipping frustum. That would be slow and stupid.

Or you can transform the clipping frustum and check all untransformed shapes against that. That would be fast and smart.

Once that’s done, actually rendering the stuff uses OpenGL, and any transforms you apply in rendering will be accelerated by hardware T&L(&C) if available.

However, all the checks and stuff can’t be done in hardware for the foreseeable future, so hand-coding SSE and 3DNow! code seems to be a necessity a little while yet.

OK, that was a stupid question, i didn’t think.
But how can I use 3DNow! abd all the other stuff ? Could you point me on a paper please ?

Originally posted by Antoche:
OK, that was a stupid question, i didn’t think.
But how can I use 3DNow! abd all the other stuff ? Could you point me on a paper please ?

First, implement your engine.
Second, make your engine correct.
Third, make your engine as fast as it can be in C/C++, using a profiler and some judicious data structure design (remember: the cache can be your best friend – or worst enemy!)

THEN use the profiler to figure out which calculation-bound functions benefit from hand tweaking, and re-write those in assembly. First once for regular floating point for the P-II/Celeron core. Then once for the P-III/SSE core. Then once again for the AMD 3DNow! core.

Here’s some places to get you started:
http://developer.intel.com/software/idap/tools/perfopt/technical.htm
http://developer.intel.com/software/idap/resources/technical_collateral/pentiumiii/index.htm
http://www.amd.com/products/cpg/k623d/inside3d.html

Get ahold of a copy of nasm to actually assemble the stuff, too. Nasm is not only extensible to macro anything you want, it also is portable to a variety of platforms (just like you want your software to be, right?)

to Humus:

and adds are even slower than mults with floats …

Wrong. Usually fmul & fadd takes the same time (on P6/PII/PIII fmul is slower and has twice lower throughput)

( Latency / Throughput )

Pentium(MMX) : 3/1
P6, PII, PIII : 3/1 (fadd), 5/2 (fmul)
PIII, SSE : 4/2 (addps), 5/2 (mulps) - float[4]
K6(-x) : 2/2
K6-x, 3DNow! : 2/1 - float[2]
K7 : 4/1
K7, 3DNow! : 4/1 - float[2]