how to find bottleneck in volume rendering

I am really confused about to find the bottleneck between GPU and CPU in my program.
I have written a volume rendering program which loads many slices as 3d textures and then generate a 3d object.

when I load a 3d texture (about 30M),I found my program running so slowly that if I use mouse to drag and move the 3d object I will feel very very stagnant.
Please give me some kind advices :slight_smile:
My graphic card is NV geforce6800gt.

It is really hard to say what’s the problem with that little information. A 6800GT is able to render a 32MB texture interactively without any optimizations, i.e. brute-force.

There are several possibilities why your renderer is slow:

  • Fillrate bound: 1. Too many slices in slice-based volume rendering or 2. Too many samples along rays in raycasting or 3. Extremely large viewport or 4. A complex fragment program (e.g. on-the-fly gradients with shading)
  • AGP/PCIe bound: The volume is uploaded for each frame (for example using glTexImage3D).
  • Contex switch bound: You have a lot of pbuffer context switches per frame
  • CPU bound: You have some complex computation on the CPU for each frame
  • …

BTW: NVIDIA and ATI have several tools and presentations on their developer web-sites that can help finding bottlenecks.

Hi med3d,

could you give us some more details, please. If you are looking for the bottleneck, you first have to identify the candidates that could be your bottleneck, the main ones are listed by Klaus.

Volume rendering is usually memory access bound, that means, it is limited by the amount of memory reads or writes that occur. These are mainly texture lookups (shader) or framebuffer reads and writes (alpha blending, stencil, z-buffer). E.g., a NVIDIA 6800U graphics card does have a mem. bandwidth of approximately 30GB/s, so a 30MB texture could be read 1000 times per second from graphics memory. So, if you consider that, you probably do something wrong.

In your case, it’s also probable that you accidently chose a slow path of the driver (for example a wrong texture format that is not supported by your graphics card and emulated by the driver)


Thank you , Klaus and mlb:)
About Fillrate bound: 1, Certainly I have used too many slices for my volume rendering.
My software is designed to reconstruct the human body through a sequence of medical image slices such as CT,MR . So I have no other choice:)
2, I use hardware accelerated 3d texture for volume rendering , instead of using raycasting.
I use a proxy geometry to manage view-aligned slices during my 3d texture mapping.
3. Viewport in my program is set like this: glViewport(0 , 0 , m_iWndWidth , m_iWndHeight);
in the above code snippet , m_iWndWidth is the window width which is about 500pixels, m_iWndHeight is the versus. fragment shader isn’t too complex, so at least it is to me:) I just do texture lookup ,then calculate phong light in the shader.
The gradients is pre-calculated by CPU during the data loading and preparing period. I even used a simplified fragment shader like this:

void main (void)
	vec4  index = texture3D(my3DTexture , gl_TexCoord[0]);
	gl_FragColor = index.z;

but to my disappointment, the frame rate remains slow.

About AGP/PCIe bound: I indeed have used glTexImage3d during the data loading and preparing period,but I don’t call them each frame.


	case 8:
		glTexImage3D(GL_TEXTURE_3D, 0, GL_LUMINANCE8, width, height, depth, 0, GL_LUMINANCE, GL_UNSIGNED_BYTE, data);
	case 32:
	    glTexImage3D(GL_TEXTURE_3D , 0 , GL_RGBA , width , height , depth , 0 , GL_RGBA , GL_UNSIGNED_BYTE , data);

but I still don’t know how does these codes harm the frame rate. Please tell me more in detail.

About Contex switch bound: There is no pBuffer in my current version.

About CPU bound: I just calcuate the proxy geometry according to new mouse position ,
then send uniform to shader and redraw the whole scene.

to mld:
I load many slices and treat them as 3d textures. The slices are 512*512 in demension, each voxel takes up a byte. Is there any thing wrong?

Originally posted by med3d:

but I still don’t know how does these codes harm the frame rate. Please tell me more in detail.

Try to use mipmapping on your 3d textures, it should be supported by GF6800. It is possible that you are trashing the texture cache on the card.

Hi med3d,

You can use gDEBugger to help you find the bottle-neck. (30 days trial version is available for download )

gDEBugger profiling views contains performance counters graphs of Win32, gDEBugger and vendor specific graphic boards and drivers (NVIDIA and 3Dlabs), including: CPU/GPU idle, graphic memory consumption, vertex and fragment processors utilizations, shader waits for texture, number of function calls per frame, frame per second, amount of loaded textures and texels, etc.

The Performance Analysis Toolbar enables you to pinpoint application performance bottlenecks quickly and easily. There are commands that let you disable stages of the graphics pipeline one by one. Commands include: eliminate all OpenGL draw commands, force single pixel view port, render using no lights, force 2x2 stub textures and force a stub fragment shader.

If the performance metrics in the profiling views improves while turning off a certain stage, you have found a graphics pipeline bottleneck!

Let us know if you need any further assistance,
The gDEBugger team

Mipmapping will help with minification only.

Do you use power-of-two textures ? Non-power-of-two textures can have very bad performance characteristics due to cache trashing.


Originally posted by Klaus:
[b]Mipmapping will help with minification only.

Do you use power-of-two textures ? Non-power-of-two textures can have very bad performance characteristics due to cache trashing.

thank you, Klaus

A small test has been done today: I load a 3d texture which is 256256140 in size, the
fps can only be 2~3 ;but when the texture has been enlarged to 256256256 ,my application
can run interactive with fps about 8-10. Maybe that is where the rub is.

I used to believe that 2d texture should power-of-two, but the depth of 3d texture
can be arbitrary. That is completely wrong! :frowning:

med3d, just out of curiosity, if you use the next lower 2^n (i.e. 128) for the third dimension, do you also see the speedup?

It indeed seems strange to me too that using a non n^2 third dimension would screw up performance (then again, I don’t know how 3D textures are actually stored and/or organized in memory).

hi, tamlin

Perhaps you mistaked my meaning :slight_smile: In my test ,the texture with 256256140 in size is actually
contains a human head, I can’t omit any part of it, so I add a 256256116 part which is
full of zero to the original one.

If I use the next lower 2^N for the depth ,just as you said, the performance would no doubt
be better than the current 256256256 one. But this precedural would have cut a part of
the patient’s head in my texture , nobody allows me to do this:)

If my original texture is 256256120, surely I will use 256256128 instead,
not 256256256. Do you think so?

Using texture brick is a better approach to solve such cache trashing problem. Cut the original texture
into some bricks which are all 2^N in every dimension will save the precious video memory while
screw up the performance. But the introduction of bricks would need a new proxy geometry class,
that must be troublesome. If you are experienced in the 3d texture based volume rendering, please
give me some hint.Thanks in advance.

Hi med3d,

you should probably forget all the “conspiracy theories” about cache thrashing.

Thrashing is the term for constant data swapping (therefore the name “thrashing” that describes the sound of the head of a hard disk), i.e. when your memory isn’t large enough and the pages are swapped constantly. Cache thrashing describes the same effect, but for paging between cache and main memory.

In your situation, the cache thrashing could be between main memory and graphics memory (which is not the case since you are using only 30 MB textures). The other possible thrashing could be between texture cache and graphics memory, which I consider rather unlikely. AFAIK it works very well for NPOT 2D-textures. The same should be theoretically true for 3D-textures, but not in practice.

I think, it is very likely that the current nvidia NPOT implementation for 3D-textures is not very efficient. Maybe it does not work with mipaps or it could even a be software fallback. So, at the moment, you’d better stay with POT textures.

You should program some simple tests to find out and tell us about your results.

P.S. Did you check if you have a current driver?


Hi mlb,

You are right, current nvidia driver supports NPOT 3d-texture with not so good performance.
As what I said above, I have made a small change of 3d-texture’s depth to screw up the
FPS. The result was also described in the afore-post.

My current driver is Instrumented Driver for NVPerfKit Version 79.70. oh, coolbit is also installed.

BTW, could anybody kindly tell me how to break a 3d-texture into some bricks and then display them
as a whole? I don’t know the detail about it.

Originally posted by med3d:
Perhaps you mistaked my meaning :slight_smile: In my test ,the texture with 256256140 in size is actually contains a human head, I can’t omit any part of it
I understand that’s not possibly in the final result. I was more curious if you had tested throwing away the extra 12 planes to see if 256256128 made a performance difference large enough to say “Yep, it’s an NPOT problem”.