glMultiDrawElementsIndirect slow vertex shader when calculating joints-weights

Asked the same question on stackexchange, no answers there. (I cant include links?)

Trying to implement animation on my engine.
I’m at the first stage, rendering default pose of skinned meshes.
Working as expected but very slow.

With the below calculation, the shader takes 6ms to run.

struct MeshUniform {
    mat4 transform;
    mat4 normalMatrix;
    vec4 baseColorFactor;
    vec4 roughnessMetallicNormal;
    vec4 hasColorMetallicNormalTexture;
    mat4[30] jointMatrices;
}; 

layout (std430, binding = 4) buffer meshUniformSSBO { MeshUniform[] meshUniforms; };

layout (location = 0) in vec3 position;
layout (location = 1) in vec3 normal;
layout (location = 2) in vec2 uv;
layout (location = 3) in vec4 tangent;
layout (location = 4) in uvec4 joints;
layout (location = 5) in vec4 weights;
layout (location = 6) in int drawId;

void main() {
   MeshUniform meshUniform = meshUniforms[drawId];
   mat4 model = meshUniform.transform;

   mat4 skinMat =
     meshUniform.jointMatrices[joints[0]] * weights[0] +
     meshUniform.jointMatrices[joints[1]] * weights[1] +
     meshUniform.jointMatrices[joints[2]] * weights[2] +
     meshUniform.jointMatrices[joints[3]] * weights[3];

   vec4 positionVec4 = skinMat * model * vec4(position, 1.0);
   
   ...
}

If I remove skinMat calculation and multiplication, same shader takes less than 1ms.

// mat4 skinMat =
//  meshUniform.jointMatrices[joints[0]] * weights[0] +
//  meshUniform.jointMatrices[joints[1]] * weights[1] +
//  meshUniform.jointMatrices[joints[2]] * weights[2] +
//  meshUniform.jointMatrices[joints[3]] * weights[3];

// vec4 positionVec4 = skinMat * model * vec4(position, 1.0);

vec4 positionVec4 = model * vec4(position, 1.0);
  • Scene has 87706 vertices, shown in Blender statistics.
  • I’m using glMultiDrawElementsIndirect with single VAO.
  • Joint matrices for non-skinned meshes are identity matrix.
  • MeshUniform is persistent, coherent ssbo map. Only updated when needed.
  • I’m using the same calculation on shadowmaps, so it takes another 6ms.
  • Gpu is 1080 Ti.

I tried adding “jointCount” to MeshUniform struct, and doing the joint calculation only if jointCount > 0. But it still took 6 ms to calculate.

Is this to be expected and what can I do to improve?


With DMGregory’s suggestions on stackexchange:

I tried,

  • Multiplying joint matrices by position vector, then summing the results.
  • Pre-multiplying model with joint matrices on cpu.

It looks like this now;

vec4 positionVec4 = vec4(position, 1.0);

vec4 sum =
  meshUniform.jointMatrices[joints[0]] * weights[0] * positionVec4 +
  meshUniform.jointMatrices[joints[1]] * weights[1] * positionVec4 +
  meshUniform.jointMatrices[joints[2]] * weights[2] * positionVec4 +
  meshUniform.jointMatrices[joints[3]] * weights[3] * positionVec4;

positionVec4 = sum;

It’s still taking 5-6ms to run.


Someone in lwjgl forums posted a question similar to mine in 2012.
(Again, i cant include links)

In his last message he said;

using a constant as the array index while accessing boneMatrixes
brings performance up

Sure enough if I exclude joints array lookup from above code like this;

vec4 positionVec4 = vec4(position, 1.0);

vec4 sum =
  meshUniform.jointMatrices[0] * weights[0] * positionVec4 +
  meshUniform.jointMatrices[1] * weights[1] * positionVec4 +
  meshUniform.jointMatrices[2] * weights[2] * positionVec4 +
  meshUniform.jointMatrices[3] * weights[3] * positionVec4;

positionVec4 = sum;

It renders in 1ms. But of course resulting image is not correct.

Maybe it will give some ideas to more experienced people on OpenGL.

I would benchmark different storage options for your joint matrices. In the past I’ve had good performance from a simple array of standalone uniforms, updating for each mesh with glUniformMatrix.

I would need a 2 dimensional mat4 array since I’m using glMultiDrawElementsIndirect with single vao.
It would look like this;

layout (location = 4) in uvec4 joints;
layout (location = 5) in vec4 weights;
layout (location = 6) in int drawId;

uniform mat4[MAX_OBJECTS][MAX_BONES] jointMatrices;

void main() {
  ... 
   
  mat4 sum = 
     jointMatrices[drawId][joints[0]] * weights[0] * positionVec4 +
     jointMatrices[drawId][joints[1]] * weights[1] * positionVec4 +
     jointMatrices[drawId][joints[2]] * weights[2] * positionVec4 +
     jointMatrices[drawId][joints[3]] * weights[3] * positionVec4;

  ...
}

Wouldn’t that be too much of an upload to gpu per frame?

I’ll try it and report back.

I couldn’t make it work with uniforms.

Then I tried an SSBO that only holds the jointMatrices data and it worked!

layout (location = 4) in ivec4 joints;
layout (location = 5) in vec4 weights;
layout (location = 6) in int drawId;

struct JM {
    mat4[30] jointMatrices;
}; layout (std430, binding = 7) buffer jmSSBO { JM[] jm; };

void main() {
  ... 
    vec4 positionVec4 = vec4(position, 1.0);
    vec4 sum =
      jm[drawId].jointMatrices[joints.x] * positionVec4 * weights.x +
      jm[drawId].jointMatrices[joints.y] * positionVec4 * weights.y +
      jm[drawId].jointMatrices[joints.z] * positionVec4 * weights.z +
      jm[drawId].jointMatrices[joints.w] * positionVec4 * weights.w;

  ...
}

With this, render time is under 1ms.

Thanks @mhagain for different storage options idea.

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.