During some tuning of my skinning code I realized that some time was lost taking the simd results and copying them into our vertex buffers (x, y, z only, no w). This made me wonder if it might be faster for the driver to deal with VBO data submitted on 128 bit boundaries as well. Would increasing my local storage of a vertex to 128 bit with a 32 bit extra stride be a good thing for vertex upload?
The only down side I see is the increase in storage space in my engine’s memory.
This made me wonder if it might be faster for the driver to deal with VBO data submitted on 128 bit boundaries as well.
None of the IHVs have suggested it. And it wouldn’t make sense anyhow. The CPU driver shouldn’t be touching the vertex data; and the GPU isn’t using SSE when reading from a buffer.
The Wiki mentions a multiple of 32 bytes which is 256 bit. It was a ATI document that had mentioned that info.
To answer your question, your VBO data doesn’t need to be ON a 128 bit boundary or a 256 bit boundary. The only thing it needs is that the vertex structure to be a multiple of 32 bytes.