Physics

system · June 26, 2006, 10:08am

The price is around 300$US

If this PPU does it job well, you are suppose to have more than just rigid body simulation. It is suppose to do particle effects, realistic fog and smoke, water, wind, and everthing can be destructable.

If it delivers on its promiss, some will buy it.

You wouldn’t buy it if it can give 10x the performance of your A64 4200?

RigidBody · June 26, 2006, 11:02am

well- i wouldn’t measure the quality of a game by the number of objects flying around

concerning the performance- having only 733 mhz, does it have 10x the performance of an a64 4200 now? maybe next year the performance will be better, you say, but the a64 will be faster/cheaper, too.

Obli · June 26, 2006, 11:46am

Originally posted by V-man:
If it delivers on its promiss, some will buy it.

If “some” will make an insteresting installed base, I’ll be happy with it but for now I don’t see how this can merge easily in existing architectures. Sure Epic is playing with it: they do have so much work power they can experiment everything.

As for physics, I’m against it but a more general GPGPU environment could probably be a welcome feature for GL.

Korval · June 26, 2006, 12:13pm

If this PPU does it job well, you are suppose to have more than just rigid body simulation. It is suppose to do particle effects, realistic fog and smoke, water, wind, and everthing can be destructable.
Which can all be done on a second CPU.

Overmind · June 27, 2006, 6:19am

In addition to that, you don’t have the communication overhead of a slow bus when using a second CPU. After all, everything has to go over the bus three times (once to feed the PPU with data, once to get the results back, once to upload the data to the GPU). When using a second CPU, you need only one of these three transfers.

That’s why I highly doubt that a PPU can be actually faster than a good multithreaded implementation on a dual-core CPU, and for the same reason I think that particle systems should be implemented on the GPU, not on the CPU or even PPU (to minimize bus traffic, which is the bottleneck in most use cases).

Komat · June 27, 2006, 4:13pm

[b]
In addition to that, you don’t have the communication overhead of a slow bus when using a second CPU. After all, everything has to go over the bus three times (once to feed the PPU with data, once to get the results back, once to upload the data to the GPU). When using a second CPU, you need only one of these three transfers.

That’s why I highly doubt that a PPU can be actually faster than a good multithreaded implementation on a dual-core CPU
[/b]
This highly depends on how good the PPU is designed and how good are its libraries written. Since the PPU has its own memory, it is possible that it can keep entire scene in it and only changes may be transfered over the bus reducing the trafic necessary.

Additionally when physical complexity of objects in the scene increases (e.g. many triangle meshes or convex hulls instead of spheres), good optimized implementation of the PPU may easily outperform the second CPU simply because the general CPU may not have enought computation power to take advantage of limitation of the bus bandwith.

Brolingstanz · June 27, 2006, 4:40pm

If this PPU does it job well, you are suppose to have more than just rigid body simulation. It is suppose to do particle effects, realistic fog and smoke, water, wind, and everthing can be destructable.
I’d gladly pay $300 to see destructable wind

Korval · June 27, 2006, 8:57pm

Additionally when physical complexity of objects in the scene increases (e.g. many triangle meshes or convex hulls instead of spheres), good optimized implementation of the PPU may easily outperform the second CPU simply because the general CPU may not have enought computation power to take advantage of limitation of the bus bandwith.
Well, let’s think about that. A 2.0+GHz Conroe or Athlon64 core vs. a 300MHz something else.

See, the reason GPUs get to beat CPUs despite the obvious speed advantage is because their job is very specialized. And that their job goes directly to an output without CPU intervention. But back to the first one.

SIMD is a paradigm that matches very well with GPU operation. Outside of the most basic vector operations (dot/cross, matrix ops, etc), this is of no value for physics applications. Being able to do 4 dot products on 4 separate sets of inputs and to 4 separate sets of outputs is not terribly important in physics. Gigantic parallelism is, likewise, of little help in physics, because any object can interact with any other.

In short, all you have is a fairly specialized CPU with a bank of relatively fast memory. For a good CPU-based library, memory accesses aren’t going to be your biggest concerns. So, maybe the PPU runs at the speed-equivalent of a 1.0GHz CPU. That’s still not going to beat the other core in my Athlon.

Plus, because I own that core, I get to decide how things get done. Physics can actually be tailored to fit a specific idea or request, rather than having one physics model for everything.

Komat · June 28, 2006, 2:20am

SIMD is a paradigm that matches very well with GPU operation. Outside of the most basic vector operations (dot/cross, matrix ops, etc), this is of no value for physics applications. Being able to do 4 dot products on 4 separate sets of inputs and to 4 separate sets of outputs is not terribly important in physics.
This is based on assumption that SIMD is the only way to gain the performance. The PPU is designed for physics operations so if SIMD GPU-like operation is not usable for that, its architecture will be different.

It is likely that its architecture is something more like the PS3 Cell processor. Many simplified processors with specialized instruction sets optimized for vector processing and other physics relevant operations controlled by one “normal” processor.

This is at least sugested by analysis (it is in Japan, so you will probably need to use Babelfish translator) based on patents applied by Ageia and other informations from Ageia.

system · June 28, 2006, 6:57am

A CPU (A64 and the older x86) is a bulky chip because it tries to carry forward a lot of legacy. It has been said that it loses performance and wastes silicon due to this. Clock cycles don’t mean much. GPU clock cycles are low as well.

Why wouldn’t it be possible to have a specialized chip for physics? It’s a matter of having the software implementation in silicon.
The data movement engine (DME) probably does just that. The PPU just calculates away and you just query it once in a while and you let it know once in a while that the user and creatures applied this force.

Overmind · June 28, 2006, 10:48am

This highly depends on how good the PPU is designed and how good are its libraries written. Since the PPU has its own memory, it is possible that it can keep entire scene in it and only changes may be transfered over the bus reducing the trafic necessary.
Yeah, well, I just want to simulate physics because I like simulating physics. I don’t care about the results

Overmind · June 28, 2006, 10:55am

It’s a matter of having the software implementation in silicon.
And what would the software implementation be? There are tons of different physics algorithms, each suited better or worse for different applications.

Ok, you could make the PPU programmable, but if you make it programmable enough to really support different algorithms, not only slight variations, then you’ll end up with a chip that’s not too different from a general purpose CPU, so why bother in the first place?

Komat · June 28, 2006, 11:28am

Originally posted by Overmind:
[quote]This highly depends on how good the PPU is designed and how good are its libraries written. Since the PPU has its own memory, it is possible that it can keep entire scene in it and only changes may be transfered over the bus reducing the trafic necessary.
Yeah, well, I just want to simulate physics because I like simulating physics. I don’t care about the results[/QUOTE]I do not see why there would be problem with having the results while simultaneously limiting the ammount of the trafic by sending only changes for example in following way: The CPU will send informations about new objects and changes in forces. The PPU will send position, orientation + collision info for objects that are currently moving with possibility to retrieve other parameters (e.g. speeds) when necessary.

Komat · June 28, 2006, 11:44am

Originally posted by Overmind:
And what would the software implementation be? There are tons of different physics algorithms, each suited better or worse for different applications.

The PPU is most likely tweaked towards use in games. If you have algorithm that is for your application better (precision, speed or other behaviour) than the algorithm provided by PPU, simply use the sw implementation of your algorithm. Someone else may be satisfied with the performance/algorithm of the PPU and use the second CPU for something else.

Overmind · June 28, 2006, 11:53am

EDIT: Replying to your first post, didn’t see the other post

Yes, but the point is, when I do the physics on a second CPU, I don’t have to send anything over the slow bus, because both CPUs have shared RAM.

I know, it’s not that simple, but the worst case szenario for multiple CPUs is an extra data copy for the physics thread and a data copy. A RAM-RAM data copy is faster than a bus transfer, and every “incremental update” strategy still applies. But I doubt that keeping every physics object in RAM twice and syncing each frame is the best synchronisation strategy.

I don’t say it’s not possible to optimize a PPU to the point that it can perform the same as a second CPU. I just don’t see how it could ever perform significantly better.

So why should we accept the obvious disadvantage of being locked to a single implementation, when we can get every algorithm we can think of for free without a performance penalty by just adding another CPU?

Komat · June 28, 2006, 12:17pm

[b]I don’t say it’s not possible to optimize a PPU to the point that it can perform the same as a second CPU. I just don’t see how it could ever perform significantly better.

So why should we accept the obvious disadvantage of being locked to a single implementation, when we can get every algorithm we can think of for free without a performance penalty by just adding another CPU?
[/b]
I do not believe that you generaly can get to the performance of the specialized PPU in situation it was designed for. Maybe you can do that in the first generation of PPU (like it was with the S3 Virge “decelerator” cards) however it is cheaper to add additional units inside the PPU or increase its frequency than to add additional cores to the ordinary cpu. If sufficient number of PPU is sold, the price is likely to drop so then the price to performance ratio of the PPU may be much better than is now.

You had good point with the bus trafic however this is function of ammount of objects in the scene and complexity of the interactions between them. If high amount of simple objects is used, powerfull CPU may win, if the interaction is complex (complex collision geometries, many constrains), the brute force of PPU is likely to prevail.

system · June 28, 2006, 12:29pm

Originally posted by Overmind:
And what would the software implementation be? There are tons of different physics algorithms, each suited better or worse for different applications.

Ok, you could make the PPU programmable, but if you make it programmable enough to really support different algorithms, not only slight variations, then you’ll end up with a chip that’s not too different from a general purpose CPU, so why bother in the first place? [/QB]
I think there isn’t any choice. A "good enough for FPS " physics code needs to be chosen and fitted into a chip.

It is my understanding that physics for simulations doesn’t have much organization. For graphics, there is Siggraph. Maybe this has caused people to take different approaches for solving physical simulations?

A few people say that there isn’t much parallelism possible in physics. Why? In nature, everything happens at the same time.

On the CPU, you can transform one vertex at a time with SSE. If one creates a real vector processor and transform many in parallel, as may as registers would allow, it would win.

On another matter, people think that a cpu does simple instructions. move, add, sub, div, cmp.
For example,
gl_Position = gl_ModelViewMatrix * gl_Vertex;

Why should that turn into 4 DP4?
If your GPU can do 4 DP4 in parallel, have a instruction for mat4x4
For years, we have been bombarded with the idea that 1 instruction, many data is better.

RigidBody · June 28, 2006, 12:48pm

A few people say that there isn’t much parallelism possible in physics
of course it is possible. the finite element solver which i use every day to crash cars typically uses 16 cpus. actually it’s just an explicit time integration of the type x(t+dt) = M x(t) with x being a vector and M being a matrix. a rigid body system would be of that kind, too.

i agree, for these kind of problems a physics accelerator could make sense. to solve a rigid body problem, it may be necessary to perform a thousand time steps per second (for stability reasons), while the gpu only needs to know about the solution every 30th step (if it runs at 30 frames/sec).

Swiftless · July 1, 2006, 9:00pm

Everyone is saying get a second cpu or a dual core cpu. But for people like me, the cost of that rises significantly. I have an AMD XP 2600+, to do what you are all saying, I would have to go and buy a new motherboard, then the CPU(s). For me, buying the PPU would be cheaper, especially later on when it comes down in price, right now it has only hit the market.

I would say that if the PPU was programmable with different API’s, not just Novodex, then it would have a better chance in the market. Then they could look into an OpenPL for it.

Overmind · July 2, 2006, 3:39am

I would say that if the PPU was programmable with different API’s, not just Novodex…
But exactly that’s the problem. Not every physics API uses the same algorithms, so it may not be possible to support other APIs without changing the hardware…