render to vertex buffer

babis · May 19, 2008, 4:12am

Hello,

I’m currently using ReadPixels to copy some data from a fbo attachment texture to a vbo, bound as PIXEL_PACK_BUFFER (nullifying the previous contents each time). Is there any faster way (with no copy) to do this with the latest extensions / HW ?

I’ve searched around the web for any recent references, but with no luck, so I’m wondering if the above is still the way to do it.

Thanks,
babis

Zengar · May 19, 2008, 4:42am

Nvidia has the GL_EXT_transform_feedback extension

NiCo1 · May 19, 2008, 4:51am

AFAIK copying the FBO data to a VBO using a PBO like you said is still the best option for now.

babis · May 19, 2008, 4:54am

Thanks Zengar, I completely forgot that!

@Nico :
Shouldn’t transform feedback with GL_RASTERIZER_DISCARD_NV be way faster? Without considering compatibility issues,that is.

NiCo1 · May 19, 2008, 5:16am

Doesn’t the transform feedback record vertex attributes only? My guess was that you wanted to reinterpret the rgba color data in the FBO as xyzw vertex components (or other attributes) in a VBO.
I have limited experience with transform feedback so correct me if I’m wrong

babis · May 19, 2008, 7:54am

The spec says that it can also record varyings from vertex/geometry shaders, which seems -almost- just as good, haven’t tried it out yet though.

Seth_Hoffert · May 19, 2008, 7:56am

If I understand correctly, you’re currently rendering computational data into a framebuffer for use as vertex data? If this is the case, then you’ll definitely want to check out transform feedback - it’s awesome.

babis · May 19, 2008, 8:04am

You are correct, the only problem with this approach is that I cannot use a quad for computing + fetching 2d samplers, since any to-be-returned data must be specified in the vertex/geometry shader.
Rendering the quad as points & fetching stuff from a TBO could be a workaround, but it would be slow I guess for a big quad.
But I guess the good point is that there are sooo many options

NiCo1 · May 19, 2008, 8:12am

Indeed, that was my point exactly. I’m also rendering my data to the FBO using a single full screen quad and calculating the output in the fragment shader…

Seth_Hoffert · May 19, 2008, 8:14am

I just ran a test: I render a VBO containing 100100100 vertices (each vertex being a vec2). I output one float per vertex, into another VBO (2.0 * gl_Vertex.x). It runs at 273 fps - not bad!

It would be interesting to compare this against the render-to-framebuffer method.

EDIT: I use transform feedback for quad-like data and it works out well (and avoids the framebuffer mess). I do this for a vector field application - I first send the points through and compute a user-defined function on each point, and capture this back into a VBO. I then instance render arrow geometry, and use gl_InstanceID as a lookup into the texture buffer object to which the VBO was bound. However, now I’m curious as to what the speed would be if I rendered to a texture and performed texel fetches on this…

Seth_Hoffert · May 19, 2008, 8:37am

Hmmm, are you wanting to compute data in a per-pixel correspondence kind of way (like for use in deferred shading), or are you computing a smaller-than-the-window quad for use with discrete objects?

I think the transform feedback method makes more sense for the discrete objects, but the FBO approach makes more sense for per-pixel data.

babis · May 19, 2008, 4:42pm

Actually my texels are particles, so logically the FBO is the way to go, but I’ll do some further tests and see.

However, now I’m curious as to what the speed would be if I rendered to a texture and performed texel fetches on this…

If you have any results of the comparison, feel free to post!

babis · May 19, 2008, 6:55pm

I did some tests of my own, and it seems that fetching from TBO & sending a varying is a bit slower than fetching from a texture2D, same formats, 1K by 1K, rendering some hundred points. But my timer seems to actually suck.

Does anybody know any precise timer, for measuring a specific pass for example? I used now the NVidia one (using timer queries), & although I use glFinish before begin / end query, the results vary wildly.

Seth_Hoffert · May 24, 2008, 7:59am

This is good to know, thanks for performing those tests.

Seth_Hoffert · June 11, 2008, 5:13am

I just got around to testing transform feedback & render-to-texture… unfortunately, render-to-texture was much faster. I don’t understand why this has to be the case, though. With transform feedback, the vertex shader runs n times. With render-to-texture, the fragment shader runs n times. Both use the same stream processors on my hardware, so what’s the deal?

I tried a 2000x2000 grid with 3 128-bit outputs per point. With transform feedback, I get ~35 fps, but with render-to-3-textures, I see about 232 fps. Hopefully a future HW revision fixes this… I ought to test CUDA, but I assume CUDA would hit around ~200 fps too just like render-to-texture.

Seth_Hoffert · June 11, 2008, 5:34am

I take that back. The render-to-texture version actually runs at 527fps. Ouch.

Seth_Hoffert · June 11, 2008, 1:39pm

…wrong again. I checked to make sure everything was being written properly. Here are my correct results:

My input is 1000x1000 evaluations, and I’m writing 4 component output, 32 bits per component, 3 outputs.

Transform feedback, with both interleaved and separate VBO modes: 7ms (~142.9 times/second).
Render to texture (using GL_RGBA32F_ARB): 1.09ms (~917.4 times/second).

For 2000x2000 evaluations:

Transform feedback, with both interleaved and separate VBO modes: 28ms (~35.7 times/second).
Render to texture (using GL_RGBA32F_ARB): 4.4ms (~227.3 times/second).
CUDA (without mucking with assembly): 41.6ms (~24 times/second)

Conclusion: Transform feedback has a bit of room for improvement on my particular implementation (G80) I’m not sure what’s up with CUDA. I also tried stripping out all VBO code and using cudaMalloc, but this helped none.

Timothy_Farrar · June 11, 2008, 5:45pm

Render to texture uses a dedicated ROP/OM path in the hardware, where as I would guess transform feedback just writes directly to global memory in the shader itself.

Just guessing here, in the CUDA/transform feedback case, the shaders are stalled on write to global memory, while on the ROP write to texture case the outputs get hardware queued at that point, and the shaders continue with new work.

Would be an interesting test to try the same calculations on a pure bandwidth bound algorithm to see if the ROP path is still the fastest.

Awesome find, kind of makes me rethink my usage of transform feedback!!

BTW, is there an extra GPU<->GPU memory cost with render to vertex array on G80? Did you factor that kind of thing in your calculations?

NiCo1 · June 12, 2008, 12:28am

I seriously doubt that. I specifically remember someone from Nvidia saying that writing to global memory is fire-and-forget.

Seth_Hoffert · June 12, 2008, 5:06am

Hopefully this is one of those things that improves in a future hardware revision.