Fastest method of CPU to GPU vertex data transfer

What is the fastest - i.e. lowest CPU cost - method for transferring per-vertex data to the graphics hardware, when the per-vertex data is changing every frame?

For example, display lists presumably dont help as the data is changing per frame. Should I use vertex arrays, or VBOs, or just stay with immediate mode?

When I say per-vertex data, I mean in an abstract sense; I need to get about 12 floats per vertex across. I am trying to do as much work as possible on the GPU via GLSL vertex shaders, as I have a CPU limited computation. The GPU can therefore get the data out from the vertex position, or the normal, or texture coordinates, or even fog coordinates. Is it better to (for example) pack as much data into as few 4D texcoords as possible, or do vertex positions/normals offer a quicker route, or are they all the same?

For reference, I have about 40k vertices.

I realise that this is a complex issue, with dependency on graphics card, driver version and so on, but if anybody has some general pointers, that would be helpful. Even small amounts of performance improvement will help.

Thanks in advance,

David

VBOs should be much better than immediate mode, and probably a good bit better than plain vertex arrays as well. Make sure to use the proper parameters for BufferData (STREAM_DRAW or DYNAMIC_DRAW most likely, depending on how you use the data within a frame).

If you generate the new vertex data in such a way that you can write it in one pass, in order, and without reading the previous values, you probably want to write it directly to a mapped buffer…if you need to read the old values, or write out of order, you might be better off keeping a copy in RAM and copying it to the VBO with BufferData.

As far as the vertex data layout, it shouldn’t matter too much unless you are building very complicated meshes out of the 40k verts, doing lots of render passes, or running on old hardware. I’d probably say to use generic vertex attribs in whatever way maps most naturally to your data, and not worry about packing them…Z

For data that is only used once, I recommend straight system memory vertex arrays (no VBO).

You can’t, under any circumstances, copy data to a VBO and render it faster than you can just render it.

It might be beneficial to write to a mapped VBO because you avoid the copy. But this is pure theory. In practice it just won’t happen.

All the relevant drivers will either
a)give you a system memory block to fake the map, in which case you gained nothing and lost nothing. This has exactly the same system=>graphics bandwidth cost as using system memory vertex arrays right away. Overall it’s not working out to the same performance because the VBO occupies graphics memory and the data doesn’t get fed to the transform stage directly.

b)give you a true map (rare!), but the write performance of the mapped space will be significantly lower than that of system memory. Instead of your transfer being slow, your data producing code will run slow. No win.

If you’re very concerned about cache pollution when writing to system memory, you can either use MOVNTQ for writes, if it’s supported and you’re not afraid of assembler, or – and I’m not joking – stay with immediate mode.

In the end though every application is different. I want to encourage you to implement a number of methods and benchmark them on your target hardware. This is the single most reliable way to find out what works best for your app.

Suggestions for benchmarking:
1)System memory vertex arrays
2)Producer writes to mapped VBO
It’s recommended practice to call BufferData(…,NULL) immediately before establishing the mapping to reduce driver headaches.
3)Immediate mode

Optional just for kicks:
4)Producer writes to system memory, then use BufferData to copy to VBO and render from there.
IMO not worth doing, but if you have the time …

As far as I can tell, mapping a large buffer takes a bit of time. Doing so for every frame would slow down your program.

I suggest you stick with vertex arrays. I suspect your CPU is the bottleneck.

Originally posted by zeckensack:
[b]For data that is only used once, I recommend straight system memory vertex arrays (no VBO).

You can’t, under any circumstances, copy data to a VBO and render it faster than you can just render it.[/b]
But NVidias old VAR extension demo was a mesh where all the vertices were dynamic and it showed a large performance increase over standard vertex arrays.

I thought that writing to uncached (AGP) memory was as quick as writing to cached system memory. It’s just reading that’s slower. In which case, you should use VBO in stream mode, which would probably guarantee you AGP memory, then map it to update it. This leaves the card free to dma the data while you get on with something else.

I thought that writing to uncached (AGP) memory was as quick as writing to cached system memory. It’s just reading that’s slower.
That depends on the cache achitecture write trough/ write back. In write back it will combine multiple writes into one larger write. (I.e. modify the cache then write it back all at once later.)

Charles