That is why ATI often beats NVIDIA in 2nd gen game engines like source(HL2) while NVIDIA owns third gen game engines like doom 3 and so on.
I think you have your engine generations backwards. Doom 3 is a weak engine compared to HL2.
And the real reason why nVidia hardware runs fast on Doom 3 is because it does things nVidia hardware likes (or, more accurately perhaps, nVidia made their hardware fast at what the Doom 3 engine does). ATi’s hardware is more agnostic when it comes to the performance of various features, so it doesn’t favor or disfavor certain applications.
Also, what about floating point blending, floating point mipmaps, and vertex texture fetch?
I imagine that the X-Box 360 has some of these, though that doesn’t help us 
In any case, ATi likes implementing features the right way. For example, in the aforementioned 360 hardware, the same processors that operate on vertices operate on fragments; that’s why it has vertex texturing that is virtually identical to fragment texturing.
nVidia, on the other hand, will be perfectly content with having two different kinds of texture units, one for vertices and one for fragments. A waste of transistors to be sure, but at least the feature is there.
Why can’t they implement PBOs as rendering from a texture to a vertex buffer for the “copy”?
Because a buffer object has a very specific memory format. And a render buffer may also have a very specific memory format. The two of them may well not be the same format. Indeed, I imagine that they aren’t.
Buffers tend to be row-aligned to specific alignments. Buffer objects are linear in memory. You can’t render from one to the other unless the buffer is the correct size. And there’s no way for the user to know what the “correct” size is.
However Antialiased float-16 buffers are supported in D3D, and we still don’t have any support in OpenGL.
That’s because neither ATi nor nVidia as of yet supports the FBO extensions for framebuffer blitting and multisample framebuffers. Once these are supported, you should have such things available. Where the hardware can handle these and floating point framebuffers, of course.