I have an application that i did using FBOs and GLSL and it runs at ~110 fps on avg. Now I am trying to port it to CUDA. I am using the PBO for fast transfer (I pinp pong btw 2 pbos) and then render the resulting image using DrawPixels however it performs poorly (~25 fps on avg). I know that this slowdown is due to interop btw OpenGL and CUDA. I want to ask other more experienced users on what is the best method to render the output generated from CUDA to framebuffer?
From my research these are the choices
PBO + drawpixels
PBO + texsubimage2d (render to texture and then put this texture on a screen aligned quad)
I dont know more if u can add in?
I have only tried 1) should I try 2) or does anyone have a 3rd or 4th option?
Adding another insight into this. I did a time comparison among the two methods for just the calculation part not taking the transfers into account and the timings are
GLSL: ~0.001msec (using hires timer).
CUDA kernel: ~25 msecs (using cudaEvent API).
So does it mean that my CUDA kernel needs to be optimized further?
My CUDA code is an exact copy of GLSL code even then, there is a considerable difference in performance btw them. I know that I might get some more juice from my hardware using CUDA (shared memory and other caveats) but am I correct to say that GLSL is doing a lot of background optimizations which are transparent to us. These show up to you when you compare with some other API like CUDA. Am I correct to say this?
CUDA uses more precise floating point operations, or better GLSL uses more relaxed math. Using OpenCL I saw the same, but you can specify to the OpenCL compiler that it can use faster (more unprecise) math, which gave me a good boost (but not the same performance as GLSL). Take a look if you can do the same for CUDA…
WOW. Thanks Chris Lux, CUDA nvcc has two flags (-use_fast_math and /Ox for full optimization). I just passed that in and VIOLAVOILA (thanks ZBuffer) my performance is much better now.
CUDA: ~7.25 msecs
GLSL: ~8.5 msecs
Some more questions:
Is there a way to control precision of floating point in GLSL. I have already specified
precision highp float;
in all my shaders ?
Do u think, adding in shared memory will help further improve performance for CUDA?
Do I go ahead and replace drawpixels with using shader to render the output image?
Thanks for the VOILA hint I have corrected the original post. Just updating stats here.
Now my CUDA code is using PBO + glTexSubImage2D to update the texture and then renders that texture rather than DrawPixels call and the new performance results are:
(All of these stats are generated on my NVIDIA Quadro FX 5800 GPU for an output resolution of 1024x1024 with contnuously updating the rendering.)
GLSL (using FBO and multiple textures): ~108-109 FPS (MaxFPS: ~1175)
CUDA (using glTexSubImage2D to display): ~89-90 FPS (MaxFPS: ~1597)
CUDA (using glDrawPixels to display): ~58-59 FPS (MaxFPS: ~248)
Since the algorithm I have has two parts update and render I break up the timing and here are the results.
This actually creates some complexities for both NVIDIA and users. On a technical level, Fermi’s context switching is relatively fast for a GPU, but on an absolute level it’s still slow. CPUs can context switch in a fraction of the time, giving the impression of a concurrent thread execution even when we know that’s not the case. Furthermore for some reason context switching between rendering and compute on Fermi is particularly expensive, which means the number of context switches needs to be minimized in order to keep from wasting too much time just on context switching.
What they were saying was that a single Quadro graphics card experienced choppiness because of context switches between compute and graphics, whereas the Tesla/Quadro solution did not because it didn’t need to context switch as each card was dedicated to a context. So my thought was that your single card may be experiencing the same context switch delay. I suppose if you can optimize the CUDA solution so that it’s much faster than the GLSL one, the added context switch time would be acceptable.