render to vertex buffer

Yeah, what was I thinking, global write should be fire and forget.

2nd guess, you are writing out 3 vec4 outputs! Somehow I missed that. So each vert writes {XYZW XYZW XYZW} for transform feedback / CUDA. I think you are looking at some bank conflicts on a global memory write which will slow this down (something the ROP/output merger would do for “free”).

Should be easy to test by comparing speed of 1 vec4 vs 3 vec4 outputs.

Writing out to one vec4 output only with transform feedback (2000x2000 points): 14 ms (~71.4 times/second)

Render-to-texture: ~1.88 ms (~531.9 times/second)

Sorry for may be little offtopic.
For G80 I’ve found, that common way (FBO->PBO->VBO) is about 10-20 percents slower, then simple render-to-texture without PBO copying and then fetching from vertex shader into that texture. Of course, you have to create prefilled VBO with texcoords for fetching.