I have an interesting dilemma to solve. I work on a sophisticated animation system. Our characters have about 112 bones in their skeletons. Each model has a large number of meshes, and each mesh has its own shader (which in our case uses CG). Our shaders are generated procedurally dependent on the material attributes of each mesh.
We skin on the GPU, which in theory is nice and fast. But we have big scalability issues. The problem is that we need to upload the skinning data (3x4 matrices) for each mesh. Our rendering engine sorts by shader (so that we only incur a single setup / teardown per batch of meshes that share a shader), which reorders meshes; and that - combined with the fact that (a) there’s probably too much skinning data for CG to store anyway and (b) you can’t share data between two shaders means that we are continually pushing matrix buffers up to the GPU. Profiling has shown that this quickly becomes the bottleneck as characters are added to our scene.
Ideally - naively - we would upload all the skinning data at once, and then use integer index buffers to select the relevant matrices for each submesh. But that won’t work because the submeshes can, and generally do, have different shaders. So I’m kind of stumped as to how to proceed here. Help would be appreciated.
What if you used a RGBA 32bit floating point texture as a matrix lookup and then just upload the matrix reference as a texture coordinate.
You should also try to unify some of the shaders if possible.
If the only thing that differs is the material attributes then i suggest you use a texture for that.
there’s probably too much skinning data for CG to store anyway
You mean here that if you wanted to upload all skinning data for all meshes at once to your shader, it would not be possible because the gpu would not be able to handle so many uniform variables?
you can’t share data between two shaders means that we are continually pushing matrix buffers up to the GPU
I am not enough used to Cg to discuss about it. I just want to inform you that with GLSL, uniform data is part of the program object state and not the shader object. So all shader objects attached to one program object share all uniform data.
I know that in Cg there is no notion of ‘main’ function. So you might write/generate one big source shader code with one particular entry point for each material. This way, all meshes that share the same material could also share shader global data like the skinning one.
But I may be mistaken here, I do not know Cg enough to be sure.
I think there’s a limit on the size of a uniform array. I’d be happy to be wrong! 112 bones in a skeleton = 112*(3*4)*4 bytes in the matrix palette, which is > 4k which I think could be the limit.
I thought of switching from matrix to quaternion+vector (11274 is < 4k). Fine, but then there’s extra math to do, plus I am still left with the issue of the shader sorting forcing many redundant uploads.
I am not enough used to Cg to discuss about it. I just want to inform you that with GLSL, uniform data is part of the program object state and not the shader object. So all shader objects attached to one program object share all uniform data.
Yeah we supported GLSL some while back. We jumped to CG because it was a pain maintaining both. For the meantime, we’re stuck with that choice.
I know that in Cg there is no notion of ‘main’ function. So you might write/generate one big source shader code with one particular entry point for each material. This way, all meshes that share the same material could also share shader global data like the skinning one.
The restrictions on shader size / number of instructions are likely to foul such a strategy.
Cool, thanks. Looking at their shader code they are relying on new features (texture lookup in vertex shaders for example) that our CG doesn’t support. But it’s a great read anyway!
Does CG support uniform buffers? That feature provides a way to share a block of uniform values between different shaders via uniform buffer binding points. This way, you could have one uniform buffer for each character which is updated only once per frame.
Also, if you notice that sending uniform data is becoming a bottleneck, you might want to reconsider sorting by shaders for your characters and sorting by skinning data instead. This would reduce the amount of data uploaded. Another way to achieve a similar result is to do as zeoverlord suggested and combine some shaders into a more general one with some conditional statements. Instead of having say 8 different shaders for different parts of the character, you could use say 1. If conditional statements are not directly available, you could simulate them by multiplying some results by 0 to nullify their effect (same thing as using an “if” statement).
^ CG does indeed support uniform buffers. I think they have a maximum size which could well be 4k, though I can’t get a definitive answer on that. However they are local to the program.
I’ve been experimenting a bit, and it looks like if we change our CG profile from arbvp1 to vp40, we can access textures from within vertex shaders. That means we can pack the matrix palettes into a texture (as suggested earlier) and use them with tex2D() lookup, and that will be a big improvement. In fact, packing all the skinning data for all models into an uber-texture would reduce the overhead to a single upload, and then we wouldn’t have to worry about sorting by skinning data.
Why dont you try to perform skinning and the use result and apply shaders.
Can you use transform feedback or similar tech?
If you cant use transform feedback you can try this:
Store all vertices, nomals, binormals, tangents, indices, weightings each one in separate rgb32f texture (or all in one 3d texture).
Store skinning matrices in another rgba32f textures. Then write shader which will read weight and indices texture, perform lookup in matrices texture and then calculate skinning. Result of calculation should be in MRT 4 x rgb32f (pos, norm, binormal and tangent). Create FBO with 4 color attachmnets, and render screen aligned quad.
Final result are skinned vertices in rgb32f textures with all skinned attributes. Use PBO to readback vertices from texture to PBO, then rebind PBO as VBO, setup pointers and draw.
You can optimize if you throw away tangents and binormals if you dont need them. This will remove two rgb32f source textures and two rgb32 destunation textures… or encode source tangents and binormals in one texture. Dont forget to set nearest filtering.
Also, you can store multiple characters and instances in one texture and skin them all using just one draw call (screen aligned quad).
I use texture buffer object to contain all the skeleton matrix and texelFetch it in vertex shader. It’s easy and fast, working well with instance drawing.