I have some question before I going to try implement GPU skinning

When doing GPU skinning , where I should caclculate all of matrix / quartenion interpolation ?

If I do the calculation on CPU and then upload the computed matrix to GPU as a uniform variable for each frame along with weight and index for each vertex as attribute variable , will it still be faster than CPU skinning ?

Since doing all matrix/quartenion interpolation on GPU seem pretty hard and I still have to use the computed transformation matrix for my collision model on CPU.

The joint/bone count of my animated model is around 200 and each vertex influenced by 1-3 joint.

You should calculate the matrices on CPU. Updating the matrices is much faster, then uploading the full model.

But remember the hardware limitations, a minimum implementation requires only 512 vec4 as uniforms (=64 mat4). That means only 60 to 63 bones will work on all cards. A possible solution is to transpose the bone matrices, so that only 3 vec4 are needed.

The weight and bone weight and index will be static. One the best formats for both are GL_UNSIGNED_BYTE with 4 components. The weight attribute should be normalized (range 0…1 instead 0…255)

The matrix xform followed by interpolation is often done on the GPU with a matrix + weights for each active bone available to the vertex shader. Some things have undermined this for example the need to efficiently extrude silhouette shadow volumes in Doom forced a different tradeoff because in the end after adding the degenerates for the shadow volumes it was a wash and still wasn’t an optimized beam tree, but things have moved on a bit since then.

The actual matrix concatenation is typically done on the CPU, or avoided/preprocessed, but you can break this down many ways. Remember GPU stuff is per vertex so the benefit of something like full concatenation (sequential xform on GPU) might depend on the vertices per bone you have. This will usually boil down to whether you have individual bone matrices or a bone concatenated with the modelview. e.g. you could get your bone to object space xform done then do the weighted skin interpolation and transform that result through the modelview to offload a lot of the model object concatenation without excessively burdening the GPU. i.e. just one more matrix per vertex tather than per bone.

The collision bounds can be simple and might if you’re smart just involve the bones themselves. For example a single axis vector through your bone can be the centerline of a cylinder used for collision detection, if that bone is axis aligned in the Jesus pose you can pluck this right out of a matrix row and translate it, that cylinder would be in object space unless the matrices were concatenated, and you then have the option of doing collision in object space without any full xform. Collision should be heavily optimized to minimize the lowest level tests and you can pare the potential collisions to a small set which might be more amenable to xform into character object space with bones in object space. Even that’s more detailed than a lot of collision implementations, although things are changing.

So, think about the most efficient space to represent stuff and do the xforms in, GPUs are not used for full bone model xform concatenation, and with a bit of deftness you might be able to avoid this on the CPU too (by splitting model & bone matrix). Bones aren’t hierarchical at this level, someone might have some funky dynamc IK system that does it differently but bone animations are usually flat (with interpolation between flat poses of the same bone), although you CAN do it differently (solves some origin slerp issues).

Just keep in mind that if you use multipass rendering and hardware skinning, it might happens to GPU have to skin same model several times (once per pass). In this caase, think about skinning on CPU and then upload skinnwed mesh, or some render_to_vertex_buffer.

Using some SSE optimized math library skinning can be very fast (like Intel IPP).