I’m currently using ReadPixels to copy some data from a fbo attachment texture to a vbo, bound as PIXEL_PACK_BUFFER (nullifying the previous contents each time). Is there any faster way (with no copy) to do this with the latest extensions / HW ?
I’ve searched around the web for any recent references, but with no luck, so I’m wondering if the above is still the way to do it.
Doesn’t the transform feedback record vertex attributes only? My guess was that you wanted to reinterpret the rgba color data in the FBO as xyzw vertex components (or other attributes) in a VBO.
I have limited experience with transform feedback so correct me if I’m wrong
If I understand correctly, you’re currently rendering computational data into a framebuffer for use as vertex data? If this is the case, then you’ll definitely want to check out transform feedback - it’s awesome.
You are correct, the only problem with this approach is that I cannot use a quad for computing + fetching 2d samplers, since any to-be-returned data must be specified in the vertex/geometry shader.
Rendering the quad as points & fetching stuff from a TBO could be a workaround, but it would be slow I guess for a big quad.
But I guess the good point is that there are sooo many options
I just ran a test: I render a VBO containing 100100100 vertices (each vertex being a vec2). I output one float per vertex, into another VBO (2.0 * gl_Vertex.x). It runs at 273 fps - not bad!
It would be interesting to compare this against the render-to-framebuffer method.
EDIT: I use transform feedback for quad-like data and it works out well (and avoids the framebuffer mess). I do this for a vector field application - I first send the points through and compute a user-defined function on each point, and capture this back into a VBO. I then instance render arrow geometry, and use gl_InstanceID as a lookup into the texture buffer object to which the VBO was bound. However, now I’m curious as to what the speed would be if I rendered to a texture and performed texel fetches on this…
I did some tests of my own, and it seems that fetching from TBO & sending a varying is a bit slower than fetching from a texture2D, same formats, 1K by 1K, rendering some hundred points. But my timer seems to actually suck.
Does anybody know any precise timer, for measuring a specific pass for example? I used now the NVidia one (using timer queries), & although I use glFinish before begin / end query, the results vary wildly.
I just got around to testing transform feedback & render-to-texture… unfortunately, render-to-texture was much faster. I don’t understand why this has to be the case, though. With transform feedback, the vertex shader runs n times. With render-to-texture, the fragment shader runs n times. Both use the same stream processors on my hardware, so what’s the deal?
I tried a 2000x2000 grid with 3 128-bit outputs per point. With transform feedback, I get ~35 fps, but with render-to-3-textures, I see about 232 fps. Hopefully a future HW revision fixes this… I ought to test CUDA, but I assume CUDA would hit around ~200 fps too just like render-to-texture.
…wrong again. I checked to make sure everything was being written properly. Here are my correct results:
My input is 1000x1000 evaluations, and I’m writing 4 component output, 32 bits per component, 3 outputs.
Transform feedback, with both interleaved and separate VBO modes: 7ms (~142.9 times/second).
Render to texture (using GL_RGBA32F_ARB): 1.09ms (~917.4 times/second).
For 2000x2000 evaluations:
Transform feedback, with both interleaved and separate VBO modes: 28ms (~35.7 times/second).
Render to texture (using GL_RGBA32F_ARB): 4.4ms (~227.3 times/second).
CUDA (without mucking with assembly): 41.6ms (~24 times/second)
Conclusion: Transform feedback has a bit of room for improvement on my particular implementation (G80) I’m not sure what’s up with CUDA. I also tried stripping out all VBO code and using cudaMalloc, but this helped none.
Render to texture uses a dedicated ROP/OM path in the hardware, where as I would guess transform feedback just writes directly to global memory in the shader itself.
Just guessing here, in the CUDA/transform feedback case, the shaders are stalled on write to global memory, while on the ROP write to texture case the outputs get hardware queued at that point, and the shaders continue with new work.
Would be an interesting test to try the same calculations on a pure bandwidth bound algorithm to see if the ROP path is still the fastest.
Awesome find, kind of makes me rethink my usage of transform feedback!!
BTW, is there an extra GPU<->GPU memory cost with render to vertex array on G80? Did you factor that kind of thing in your calculations?