Yeah, what was I thinking, global write should be fire and forget.
2nd guess, you are writing out 3 vec4 outputs! Somehow I missed that. So each vert writes {XYZW XYZW XYZW} for transform feedback / CUDA. I think you are looking at some bank conflicts on a global memory write which will slow this down (something the ROP/output merger would do for “free”).
Should be easy to test by comparing speed of 1 vec4 vs 3 vec4 outputs.
Sorry for may be little offtopic.
For G80 I’ve found, that common way (FBO->PBO->VBO) is about 10-20 percents slower, then simple render-to-texture without PBO copying and then fetching from vertex shader into that texture. Of course, you have to create prefilled VBO with texcoords for fetching.