Rgarding CUDA+PBO slow down

Hi all,
I have an application that i did using FBOs and GLSL and it runs at ~110 fps on avg. Now I am trying to port it to CUDA. I am using the PBO for fast transfer (I pinp pong btw 2 pbos) and then render the resulting image using DrawPixels however it performs poorly (~25 fps on avg). I know that this slowdown is due to interop btw OpenGL and CUDA. I want to ask other more experienced users on what is the best method to render the output generated from CUDA to framebuffer?
From my research these are the choices

  1. PBO + drawpixels
  2. PBO + texsubimage2d (render to texture and then put this texture on a screen aligned quad)
  3. I dont know more if u can add in?

I have only tried 1) should I try 2) or does anyone have a 3rd or 4th option?

Adding another insight into this. I did a time comparison among the two methods for just the calculation part not taking the transfers into account and the timings are
GLSL: ~0.001msec (using hires timer).
CUDA kernel: ~25 msecs (using cudaEvent API).
So does it mean that my CUDA kernel needs to be optimized further?

What is the reason to port app to CUDA if it already works fast with GLSL?

This is probably not a GPU time, but a CPU time (since you have mentioned hires timer). There is only one way to measure GPU time in OpenGL - timer_query.

Good question Aleksandar. This is for comparison sake only since nowadays people in academia ask (how does it compare against CUDA? and why do u want to do in GLSL when CUDA is a better option?)

Thanks for this Aleksandar. With timer query, the reported time for GLSL is ~8.5 msecs. Even then, GLSL is ~3 times faster.

My CUDA code is an exact copy of GLSL code even then, there is a considerable difference in performance btw them. I know that I might get some more juice from my hardware using CUDA (shared memory and other caveats) but am I correct to say that GLSL is doing a lot of background optimizations which are transparent to us. These show up to you when you compare with some other API like CUDA. Am I correct to say this?

CUDA uses more precise floating point operations, or better GLSL uses more relaxed math. Using OpenCL I saw the same, but you can specify to the OpenCL compiler that it can use faster (more unprecise) math, which gave me a good boost (but not the same performance as GLSL). Take a look if you can do the same for CUDA…

WOW. Thanks Chris Lux, CUDA nvcc has two flags (-use_fast_math and /Ox for full optimization). I just passed that in and VIOLAVOILA (thanks ZBuffer) my performance is much better now.
CUDA: ~7.25 msecs
GLSL: ~8.5 msecs

Some more questions:
Is there a way to control precision of floating point in GLSL. I have already specified

precision highp float;

in all my shaders ?
Do u think, adding in shared memory will help further improve performance for CUDA?
Do I go ahead and replace drawpixels with using shader to render the output image?

Interesting to see that CUDA and GLSL can have similar performance when tweaking the precision.

Totally OT, but I wanted to say PLEASE, never write it this way ! I know it is a frequently mistake made by english-speaking people, and it sounds very bad and means “was raping” in french …

Correct spelling is “voila” (pronounced somewhat like “wala”). Even more correct spelling is “voilà” but the accent is really not that important.
Thanks a lot :slight_smile:

Hi ZBuffer,
Thanks for the VOILA hint I have corrected the original post. Just updating stats here.
Now my CUDA code is using PBO + glTexSubImage2D to update the texture and then renders that texture rather than DrawPixels call and the new performance results are:
(All of these stats are generated on my NVIDIA Quadro FX 5800 GPU for an output resolution of 1024x1024 with contnuously updating the rendering.)

GLSL (using FBO and multiple textures): ~108-109 FPS (MaxFPS: ~1175)
CUDA (using glTexSubImage2D to display): ~89-90 FPS (MaxFPS: ~1597)
CUDA (using glDrawPixels to display): ~58-59 FPS (MaxFPS: ~248)

Since the algorithm I have has two parts update and render I break up the timing and here are the results.

GLSL: Update(6.920 msecs/frame), Render (0.315 msecs/frame)
CUDA: Update(7.814 msecs/frame), Render (0.316 msecs/frame)

Now the next step is to optimize the CUDA code further to reach the performance of GLSL.

Nvidia mentioned in their plug for their Maximus technology that there is a context switch delay from compute to graphics mode. Perhaps that’s factoring into the timing?

From an article on AnandTech:

This actually creates some complexities for both NVIDIA and users. On a technical level, Fermi’s context switching is relatively fast for a GPU, but on an absolute level it’s still slow. CPUs can context switch in a fraction of the time, giving the impression of a concurrent thread execution even when we know that’s not the case. Furthermore for some reason context switching between rendering and compute on Fermi is particularly expensive, which means the number of context switches needs to be minimized in order to keep from wasting too much time just on context switching.

HI malexander,
I just read that article but Maximus is for a dual GPU card setup one Tesla and one Quadro. In my case I have a single card.

What they were saying was that a single Quadro graphics card experienced choppiness because of context switches between compute and graphics, whereas the Tesla/Quadro solution did not because it didn’t need to context switch as each card was dedicated to a context. So my thought was that your single card may be experiencing the same context switch delay. I suppose if you can optimize the CUDA solution so that it’s much faster than the GLSL one, the added context switch time would be acceptable.

Oh that makes a lot of sense and indeed that is the direction I am working now thanks for the link and this insight.