Recently, I modified my application to use FBO instead of PBuffer and then I was very surprised when I realized my application was running slower with the FBO. I wrote another simple test application and I got the exact same result: FBO still slower
Actually the CPU time required to render with FBO is a lot lower than with the PBuffer, but the total rendering time still higher on the FBO.
I did different tests to understand the problem and I realized that when I disable depth writing FBO rendering is getting faster than PBuffer.
Anybody understands this problem? I use a Radeon 9700 with the Catalyst 5.7.
I tried again with the Catalyst 5.8 and I have the same problem. I tried with a GeForce 6600 and everything is working fine. The FBO are not slower than the PBuffer on my 6600.
This is a guess almost out of the blue, but…
ATI h/w has something they call Hyper-Z (or whatever). It’s allegedly a hierarchical depth-buffer, that from what I’ve see in action when mixing D3D and OGL in the same thread (though using different windows) seems to be a unified depth-buffer. Unified, as while my OpenGL output looks just fine, the D3D output displays horrible 4x4 artifacts. What I’ve seen has led me to believe it’s not hierarchical at all but just an old-school tiling engine, in this case with 4x4 pixels/tile.
Now, if that is the case, and they need to do this extra indirection lookup for every (four?) pixel(s) - first lookup tile, then lookup depth value from that tile - not only would memory cache suffer badly (locality of reference would be all over the place), but also would access times (as cache thrashing would be unavoidable).
As I’ve understood it nVidia (and possibly all other vendors) have a linear depth-buffer, and would therefore never experience these problems.
Just an idea. It could very well be something completely different, and that I’m completely wrong.
Im not sure to understand how the HyperZ implementation will affect FBO rendering performance. I though they implement HyperZ to make rendering faster by discarding non-visible quads earlier in the pipeline. Do you see any reason to disable HyperZ with FBO and not with PBuffer?
What is your depth buffer (texture, renderbuffer, etc)? What is its internal format?
PBuffers can be faster than FBO because pbuffers are guarenteed to work in hardware, while FBO requires that the user know the right thing to do. The user selects the pixel format of the color and depth buffers (rather than the driver in the case of pbuffers). As such, the user can make decisions that are unoptimal for the hardware.
I’ve been having similar performance problems with FBOs vs. Pbuffers on both ATI and NVIDIA. I’ve got a Radeon 9800 Pro w/ Cat 5.7 and a Geforce 6800 Ultra with 77.77. My performances are:
ATI: 3.7fps NVIDIA: 4.8 fps
ATI: 32fps NVIDIA: 50 fps
I’m working on a deferred shading application, so I’m rendering into 3 offscreen buffers and then blending lighting in the display buffer. I’m using RGBA32F buffers (or their equivalents in WGL). I can basically double my FBO performance by switching to RGBA16F, but my PBuffer performance is unaffected by switching from 32 to 16 bit floats.
I’m very interested in finding a solution to this problem.
My depth buffer is a renderbuffer. The depth buffer internal format is GL_DEPTH_COMPONENT24. My color buffer (a texture) is in GL_RGBA8.
I’m not sure we have the same problem. With my simple test application the overhead for using FBO is less than 1ms (about 0.5 ms).
Would you mind contacting me offline (email@example.com) to discuss this? Our goal is for FBO performance to be no less than pbuffer performance. We should be able to get this issue resolved.
Im not sure to understand how the HyperZ implementation will affect FBO rendering performance.
I wouldn’t think it’s the actual rendering using the time, but the step(s) to turn the tile-based (or whatever) HyperZ (or whatever) representation into a linear representation, and then potentially back.
Again, I might be way off…
P.S. Considering cass has now asked for contact, perhaps even an ATI representative should do the same?
I wouldn’t give too much faith in the speed of the ATi FBO implementation yet, it still has the divide by zero problem when calling glBindFramebufferEXT, so I doubt the driver whacking a 1#NAN all over the place helps with keeping things fast.
I’ve determined the cause of crappy NVIDIA FBO performance in my application - dual monitors. By disabling the second monitor I was able to get identical performance out of my FBO and PBuffer inplementations. The performance gap was probably caused by the shared rendering context when using FBOs but not PBuffers.
But I only have a single monitor attached to my ATI card, so I’d still like to find an ATI-based solution.
Note also that this multi-monitor perf limitation with FBO is being fixed in upcoming drivers. It’s not something you should expect to have to live with.