Shader performance improvment?

Hi All,
I have a fragment shader for volume rendering using ray casting:

  1. Include early ray termination.
  2. Gradient estimation + gradient filtered texture.
  3. In/Out texture (using bounding volumes of the active nodes)

Currently the code run fine on G80.
Can any of you suggest how to improve this GLSL code:
(There are some disabled paths (jitter) that I want to enable in the future)



#define EPSILON (0.01)

// Uniform samplers block
uniform sampler2D rayLeaveTex;
uniform sampler2D rayEnterTex;
uniform sampler1D transferTex;
uniform sampler3D volumeTex;
uniform sampler3D gradTex;
uniform sampler2D jitterTex;


// Interpolated frag position
varying vec4  pos;

// Interpolated light position: (gl_ModelViewProjectionMatrixInverse * vert position);
varying vec3 lightPos;


vec3 CalcGrad(vec3 rayPos)
{ 	
    vec3 gradient;
    
    float x1 = texture3D(volumeTex, rayPos-vec3(EPSILON,0.0,0.0)).a;
    float x2 = texture3D(volumeTex, rayPos+vec3(EPSILON,0.0,0.0)).a;
    float y1 = texture3D(volumeTex, rayPos-vec3(0.0,EPSILON,0.0)).a;
    float y2 = texture3D(volumeTex, rayPos+vec3(0.0,EPSILON,0.0)).a;
    float z1 = texture3D(volumeTex, rayPos-vec3(0.0,0.0,EPSILON)).a;
    float z2 = texture3D(volumeTex, rayPos+vec3(0.0,0.0,EPSILON)).a;
    
    gradient.x = x2-x1;
    gradient.y = y2-y1;
    gradient.z = z2-z1;	

    return normalize(gradient);
}


void main (void)
{  
   // Sample the texture to get the leave/enter  position
   vec2 texturePos = ((pos.xy / pos.w) + 1.0) / 2.0;   
   vec4 rayLeave = texture2D(rayLeaveTex, texturePos);   
   vec4 rayEntry = texture2D(rayEnterTex, texturePos);
   
   
   // Calculate ray and segment length
   float segmentLength = length(rayLeave - rayEntry);      
   
   vec3 rayDir = normalize(rayLeave.xyz - rayEntry.xyz);
   
   float walkLength = 0.0;
   
   // Calculate ray step
   vec3 ray = rayEntry.xyz;
   
   // Jitter texture lower grain artifacts
   //vec2 rayJitter = texture2D(jitterTex, texturePos).xy;
   //ray.xy += rayJitter;
   
   vec3 rayStep = rayDir*0.002;
   float rayStepLength = length(rayStep);
   
   int numSteps = int((segmentLength / rayStepLength) + 1);
   
   vec4 blendedColor = vec4(0.0,0.0,0.0,0.0);
   
   for(int i = 0; i < numSteps; i++)   
   {
      // Get the intensity
      vec4 intensity= texture3D(volumeTex, ray.xyz);
   
      // Look up color for the fragment   
      vec4 color = texture1D(transferTex, intensity.a);
	  
          // todo - 0.3 should be non constant threshold value
	  if (color.a > 0.3)
	  {		  
	  
	    // Gradient 

	    // Calc the gradient
	    //vec3 gradient = CalcGrad(ray.xyz);

            // Get the gradient from filtered gradient texture
	    vec3 gradient = texture3D(gradTex,ray.xyz).xyz;
            gradient = gradient*vec3(2,2,2) - vec3(1,1,1);
		
    	    vec3 vecLight = normalize(normalize(lightPos) - ray.xyz);	  
	    vec diffuseTerm = abs(dot(vecLight, gradient));		
		  
	    color.rgb *= 0.25;
	    color.rgb += vec3(diffuseTerm, diffuseTerm , diffuseTerm )* color.a;		
	  }
      
      // Blend (FTB)
      blendedColor.rgb += (1.0 - blendedColor.a) * color.rgb;
      blendedColor.a += (1.0 - blendedColor.a) * color.a;
          
      // Advance ray and accumulated distance
      ray += rayStep;
      walkLength += rayStepLength;
            
      // Break if out of bounding volume or opaque 
      if(walkLength >= segmentLength || color.a >= 1.0)
      {
           break;
      }
   }
   
   gl_FragColor = blendedColor * vec4(0.5,0.5,0.0,1.0);
}


Thanks,
Ido

  1. Ray tracing algorithm improvements
    Basically it’s about making ‘rayStep’ dynamic.
  • if your volume data is not animated, you can use distance function stored in separate texture
  • if your data is animated, then you can create downsampled volume data and raytrace through it with bigger step - when you hit a pixel, you trace with low step through original volume data in that area (so you haave two nested loops actually)
  1. General improvements
  • if lighting is static - precompute it in a texture
  • everything that is static should be kicked outside the loop. This also applies to values that are partialy dynamic:
    final_value = sum(a * 0.5) => final_value = sum(a) * 0.5
    Now 0.5 can be moved outside the loop.
  • final optimizations should include instruction reduction:

blendedColor.rgb += (1.0 - blendedColor.a) * color.rgb;
blendedColor.a += (1.0 - blendedColor.a) * color.a;

blendedColor.rgba += (1.0 - blendedColor.a) * color.rgba;

ray += rayStep; (vec3)
walkLength += rayStepLength; (float)

rayPosAndLength += rayPodAndLengthDelta; (vec4)

instead of ray you use rayPosAndLength.xyz
instead of walkLength you use rayPosAndLength.w

-another thing that you can optimize is the number of steps:
You don’t have to check if you’re inside bounding box inside the loop. You can calculate number of steps that you have to take before the loop.

-if loop does not have known number of steps at compile time, driver is unlikely to unroll it (it’s possible to unroll such loop partially) and it will use actual loop instruction which is expensive. Instead of one n-step loop you can use nested loops: n/16 steps * 16 steps. Driver will unroll the inner loop in this case making the shader work much faster.

Hi,
Thanks for the tips.
Some questions please:

blendedColor.rgb += (1.0 - blendedColor.a) * color.rgb;
blendedColor.a += (1.0 - blendedColor.a) * color.a;

blendedColor.rgba += (1.0 - blendedColor.a) * color.rgba;

ray += rayStep; (vec3)
walkLength += rayStepLength; (float)

rayPosAndLength += rayPodAndLengthDelta; (vec4)

I understand that it can be vectorized but shouldn’t the compiler optimize it?

-another thing that you can optimize is the number of steps:
You don’t have to check if you’re inside bounding box inside the loop. You can calculate number of steps that you have to take before the loop.

You are correct, my for loop condition is on the number of steps, I can remove the walk length condition.
I don’t know how many steps are needed in advance, If I make nested smaller unrolled loop(loop on constant) will it save runtime because I will have to break on the walk length inside the nested loop?

  • if lighting is static - precompute it in a texture

My lightPos is interpolated varying on the fragment, Could it be make static?

Basically it’s about making ‘rayStep’ dynamic.

  • if your volume data is not animated, you can use distance function stored in separate texture
  • if your data is animated, then you can create downsampled volume data and raytrace through it with bigger step - when you hit a pixel, you trace with low step through original volume data in that area (so you haave two nested loops actually)

Does the distance texture is view dependent?
If i use downsampled volume what is the correlation between the pos of the ray in the downsampled to the original volume?
Is it a volumes of block based on the transfer function.

Greats tips.
Thanks for the help.
Ido

  1. Depends on compiler. If you do it yourself then you at least know it’s optimized :slight_smile:
    About one year ago I run into a problem on ATI drivers - I ran out of varying variables. So I combined two vec2’s into one vec4 varying variable and it worked. May be that current drivers are smarter, may be they’re not.

  2. I mentioned making ‘rayStep’ dynamic at beginning. This is what you should do first - it will give more performance boost than other optimizations. One solution is to have 2 nested loops - one has bigger step and when it’s inside a region where pixels may exist, then you enter inner loop which samples with smaller step.

  3. Depends:

  • light does not move relatively to volume data and you use difuse lighting only - precompute lighting (and shadows?) into texture
  • moving directional light - uniform variable transformed and normalized by CPU or by vertex shader
  • moving spotlight - uniform variable transformed by vertex shader to varying variable - normalization and per pixel lighting in fragment shader - so no optimization here, but you can at least do all the normalization and some other computations outside the loop.
  1. Also depends
    There are distance functions depending on view direction but for volume rendering I think you can use the most basic distance function. Each texel of a 3D texture holds a distance (in texels) to nearest opaque texel. In other words - how far can you go from that pixel in any direction without hitting anything.

Other solution I mentioned (downsampled volume) works in the following way:

  • let’s assume we have 512x512x512 volume
  • create 64x64x64 volume texture, filtering = GL_NEAREST (no filtering)
  • for each 8x8x8 block in the original texture compute max(value) and store in the downsampled texture

This way you can sample smaller texture at the same coordinates you would sample original texture. What you get is: is this block empty? If it is then you don’t do “detailed” sampling and move to the next one.
Note that you must test all block you cut across. Sounds difficult but it’s not. I’m not going to explain it right now, because I don’t even know if you’re interested in implementing this approach or another one.

Hi k_szczech,

Again many thanks for the help.
I still have some more questions:

I have two scenarios which uses two different shaders:

  1. ISO surface rendering - I use bigger step size + hit point refinement, this works well, but loses data on small regions where the step just passes the data so the hit point refinement does not kick in.
  2. Regular volume rendering using transfer function and the above shader.

When using ISO surface rendering its seems that I need to super sample the area of data so I don’t skip small details when using the hit point refinement.
The solution you described (downsampled volume) seems to solve the problem and could help a lot in scenario 2 when I need to render highly distributed volume.

I can easily create the downsampled texture and traverse it, but how do I know what is the step size so I visit every node (It depend on the view and intersection of the ray with the bounding surface).
When I need to traverse the regular volume how do I know that I’m out of the node and should return to sample the downsampled volume. If done on the cpu I would use simple octtree but every solution I think implementing this in fragment shader seems difficult or too costly. Can you elaborate please :slight_smile:

I like you idea about the simple distance texture.

In my implementation the enter/leave textures are not simple coordinates of the bounding volume box (back-front), I subdivide the volumes to 8x8x8 blocks and create a tight bounding surface around the “active” blocks, this means that my rays start almost right at the data, that was a huge performance boost. If the dynamic ray step using the downsampled volume will improve on that it would be great.

Thanks,
Ido

Well, since you allready cut your volume into small blocks, then using downsampled texture will not be an improvement, because it does exactly the same except it’s directly in fragment shader.
It’s more less the same with distance function.
Both methods serve the same purpose your “active blocks” approach does: skip empty spaces.
So, if you allready start near visible data, then it may be of not much use to implement any of them.

But if you want to further increase sampling quality within these blocks - I would suggest the distance function.

Using the downsampled texture is more difficult to implement (but not that difficult after all).
Basically you need to compute how many steps along ray it takes to jump to next block in X axis, then the same for Y axis and for Z axis.
It’s quite simple actually:
float stepsBetweenXblocks = blockSize.x / rayStep.x;
float stepsBetweenYblocks = blockSize.y / rayStep.y;

Simply:
vec3 stepsBetweenBlocks = blockSize / rayStep;
Of course you can’t do it this way (divide by 0 possible), but the idea is exatly the same.

Now you need another vec3 with information how far away (how many stpes) you’re from hitting next X, Y or Z edge between blocks.

Now, how many steps we have to take to hit any edge between blocks?
float steps = min(min(stepsToNearestEdge.x, stepsToNearestEdge.y), stepsToNearestEdge.z);
This finds how far the nearest X, Y or Z edge is.

Now we travel:
if (there is data in block we’re about to enter)
{
float ii = steps + lastFract;
int i = floor(ii);
lastFract = fract(ii);
/* lastFract is float outside the loop - it’s because we don’t want to miss any sample. If it turns out we have to take 2.4 steps each time, we’ll take 2 steps first time and 3 steps next time;
*/
stepsToNearestEdge -= rayStep * i;
for (i; i > 0; i–)
{
rayPos += rayStep;
performPerSampleOperation();
}
}
else
{
rayPos += rayStep * steps;
stepsToNearestEdge -= rayStep * steps;
}

Now we reload smallest component of stepsToNearestEdge. That means if we hit X edge, then stepsToNearestEdge.x will be smaller than stepsToNearestEdge.y and stepsToNearestEdge.z, so we reload:
stepsToNearestEdge.x += stepsBetweenEdges.x

To find out if there’s data in a block we’re about to enter, we sample at position:
rayPos + stepsToNearestEdge * 0.5;
Which means we sample in half way between this edge and next edge. This will rarely be anywhere near the block’s center but surely somewhere inside the block. That’s why texture with downsampled volume must have GL_NEAREST filtering, so that information within the block is constant.

That idea is interesting in that way, that if you put different data into volume, then you only need to create downsampled version of volume, which is obviously less costly than updating volume data itself. That means it’s good for animated volumes.

Your current approach requires re-creating geometry and distance-function based approach would require time-costly distsnce texture update.

Again - these approaches I described are nothing more than empty-space skpping implemented in fragment shader. Since you allready have it in geometry then you can work on another optimizations.

Thanks k_szczech,
You gave many points to refine and improve my shaders. I will test it and see how it goes.
Ido

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.