This is going to be a long post. Its a more coherent (I hope) version of my idea for doing soft shadows on next gen hardware. Of course people will come up with lots of different ways, this is just my idea. Its kinda abstract and general because I wrote it with the idea that I might want to post it with a demo someday.
The usual shadow mapping technique is to first render a scene from the point of view of the light source. Then render the scene from the eye’s point of view while projecting the image produced from the light’s view onto the eye’s view. By projection, it means that the light view image is mapped into the scene as if it had been proejcted from the light source. If you can match a pixel in the eye’s view with a pixel from the light’s view then that means that the light shines on it. That is because only pixels in the lights point of view actually recieve light. If there is a mismatch then that means they are in shadow.
This will produce sharp shadows because the light source is modeled as an infinitely small point. This means that any surface will always block all or none of the light. Now you see it, now you don’t. Soft shadows are produced by real light sources because real light sources can be partially blocked. If from a point on a particular surface, only half a light source is visible, then it only receives half as much light.
One way to model this is to render clusters of point lights. These closely spaced lights will produce multiple shadows which blend together and give a soft effect. The problem is that it greatly increases the amount of work that has to be done proportional to how many lights are in the cluster. 12 lights in a cluster is 12 times the work.
A cheaper way to simulate this effect would be to take the -area- of the scene from the light’s view which maps to the eye’s view and have the shadow result be the percent of that area which is in shadow when compared to the pixel in the eye’s point of view. Taking 12 samples of an area should be cheaper than re-rendering a scene 12 times.
The size of the area being sampled (and the number of samples needed to get a good result) depends on how big the light is from the point of view of the surface being rendered. From the surface being rendered, if you look towards the light, how much of it is being obscured is how much shadow you are in at that point. The closer to the light, the bigger the light will appear (and of course, the bigger the light, the bigger it will appear).
The problem with this approach compared to rendering the scene 12 times is that when the scene is rendered 12 times it is from 12 slightly different points of view. Shadows change shape depending on which angle they are cast from. A profile casts a very different shadow than a shadow cast from the front. However, this should be acceptable as long as the light source is not too big or too close to the shadows because the difference between object sillouettes is not that dramatic.
Also, other approachs which re-render the shadows multiple times do not actually consider multiple viewpoints, but just skew the shadows. Such skewing does not change the shape of the shadow, but only where it is projected. Such methods should be equivalent to this method.
What shape is the sample kernel? Its size is determined by distance from the light to the surface, but what is its shape? The simple answer is circular. This would be the projected shape of a spherical area lightsource. It is easiest because it does not change depending on what vector the light is being cast in. The sample points could simply be evenly distributed in a circle and get good results. Other shapes would require that the 3D cluster of points which define the shape of the light source be projected onto the light view image. Once these points were known, samples could be taken. Very unusually shaped lights could be modeled this way.
That was a pretty abstract explanation. One familiar with shadow mapping might wonder why I did not mention depth values. The fact is that shadow mapping does not rely on depth values, only on finding mismatches between the projected light view and the eye view. Depth values are just the most obvious thing to compare. Other implementations use a single color for each polygon (called index shadow mapping). Mismatches in color result in shadow.
When mapping a spherical area light source, one only needs the distance of the surface to the light to do a good job of approximating the size of the light. This could be gotten by mapping a 1D texture so that it corresponds to how much the filter kernel needs to be scaled.
If one wants to project a cluster of light sources then one needs to be able to determine the 3d coordinates of each light relative to the surface being rendered. One could then project those points onto the shadow map as sample points. It would probably be a fairly involved pixel shader.
I would love to turn all this theory into an implementation. Right now one could implement this on a Radeon 9700 or using nVIDIA’s NV30 emulation. I thought about this theory months ago, but at the time I abandoned it as hopeless until more capable hardware came about.