Skinning on the GPU vs the CPU

mhagain · March 13, 2015, 10:39am

OK, some confusion here.

For skinning on the GPU you use an array of bone matrices (which are friendlier for GPUs than quaternions) - one per bone in the model. I’ll use standalone uniforms here rather than a UBO for the sake of code clarity. A 4x3 (or 3x4 depending on your chosen flavour of poison) matrix is sufficient.

matrix4x3 boneMatrices[MAX_BONE_MATRICES]; // make this define the maximum you need to support

So if your model has 30 bones you only need to send 30 bone matrices; the per-vertex data is and will remain static.

Calculate the bone matrices on the CPU and send them.

Each vertex has bone indices as part of it’s attributes; one integer per index and typically 4 indices. Each vertex also has a blend weight (if you’re using them). This is totally static data and lives in a static VBO.

To run the skinning, in your vertex shader you do something like:

position = (boneMatrices[boneIndices.x] * vertexPosition) *  blendWeights.x +
    (boneMatrices[boneIndices.y] * vertexPosition) *  blendWeights.y +
    (boneMatrices[boneIndices.z] * vertexPosition) *  blendWeights.z +
    (boneMatrices[boneIndices.w] * vertexPosition) *  blendWeights.w;

So, the only data you’re sending each frame is the bone matrices, and you only need to send as many as the current model has bones, up to your pre-defined maximum. Everything else is static data. Using glUniformMatrix you can send them all in one go (rather than one at a time) which your GPU and driver will also love you more for.

As well as performance measurable by how much data you send (which is not always a reliable metric) you also should be measuring performance by how you distribute work between the CPU and the GPU. Different workloads are differently suited to each processor, and most machines should easily be able to run skinning on the GPU - even a relatively weak one like your’s (Intel graphics?) - much faster than on the CPU, because (1) most data can remain static, and (2) the GPU is just faster for this kind of calculation.