NV40 branching behavior

I am trying to make benefit of the new branching capabilities of the NV_fragment_program2 extension on my geForce 6800 GT to speed up my program but I get very strange results…
It’s a volume rendering application and currently, I use the fragment program just to make a texture lookup to map color on the 3D scalar field. To test the IF statement, I just try to kill the fragment if density > 0.0, so every fragment are killed. In fact, fragments are well killed (nothing displayed) but my framerate drop back from ~40fps to ~7fps ! If I comment the IF statement and the kill, the framerate remain to ~40FPS, the kil also seams to be inefficient…
Here is the FP code I use:


OPTION NV_fragment_program2;

ATTRIB tex0=fragment.texcoord[0];
OUTPUT oColor=result.color;

#light color
PARAM lcol={0.8,0.8,0.8};

TEMP index;
#texture fetch, alpha = density
TEXC index, tex0, texture[0], 3D;
IF GT.a;				#every fragment are killed
	KIL {-1,-1,-1,-1};	
ELSE;					#Texture lookup
	MOV index.r, 0.0;
	TEX oColor, index.argb, texture[2], 2D;


If anybody have tried to use branching and could help me…

Hi TheAlbert,

I think the main problem is that you replaced a really simple 3 instruction program with one that can now have divergent threads of execution.

To use branching efficiently, you need to have a) coherence and b) have significant work inside the branch conditions.

I can’t really tell about (a) from the shader, but (b) doesn’t look too good.

Thanks -

KIL doesn’t cause execution to terminate, you need to use RET, instead.

Try this program

TEXC index, tex0, texture[0], 3D;
MOV oColor, 0.0;
RET (GT.a);
MOV index.r, 0.0;
TEX oColor, index.argb, texture[2], 2D;

And set alpha test to discard fragments with srcAlpha=0.

Thanks cass and gking for your advices!
I successfully got a performance increase with RET and much more computation discarded when the density (alpha) equal 0. With the following code, computing an approximated gradient for basic lighting and a simple filtering, my framerate raise from ~7.5 without the RET statement to ~14 with the RET enabled :


#texture fetch, alpha = density
TEXC index, tex0, texture[0], 3D;
#IF EQ.a;				#every fragment are killed
#	KIL {-1,-1,-1,-1};	
#ELSE;					#Texture lookup
	MOV oColor, 0.0;
	RET (EQ.a);
	MOV index.r, 0.0;
	#get neighbours
	MOV coords, tex0;
	ADD coords.x, tex0, offset;
	TEX d0, coords, texture[0], 3D;
	ADD coords.x, tex0, -offset;
	TEX d1, coords, texture[0], 3D;
	MOV coords.x, tex0.x;
	ADD coords.y, tex0, offset;
	TEX d2, coords, texture[0], 3D;
	ADD coords.y, tex0.y, -offset;
	TEX d3, coords, texture[0], 3D;
	MOV coords.y, tex0.y;
	ADD coords.z, tex0, offset;
	TEX d4, coords, texture[0], 3D;
	ADD coords.z, tex0, -offset;
	TEX d5, coords, texture[0], 3D;
	#comp normal/gradient
	ADD normal.x, d0.a, -d1.a;
	ADD normal.y, d2.a, -d3.a;
	ADD normal.z, d4.a, -d5.a;
	#comp lighting
	DP3 temp, lightPos, normal;
	MUL temp.xyz, lcol, temp;
	ADD index.a, index.a, d0;
	ADD index.a, index.a, d1;
	ADD index.a, index.a, d2;
	ADD index.a, index.a, d3;
	ADD index.a, index.a, d4;
	ADD index.a, index.a, d5;
	MUL index.a, index.a, 0.1428571429;
	#get the color
	TEX dens, index.argb, texture[2], 2D;
	ADD oColor.rgb, dens, temp;
	MOV oColor.a, dens.a;


But the performance is very dependant of the amount of opaque voxel present, when the 3D map contain more than approximately a third (or an half) of non fully transparent (density!=0) voxel, it seems to be much expensive with the RET statement :frowning:
So I am still a few disappointed by the cost of this kind of branching, may be I would have to place more computation in the fragment program to decrease the number of rasterized primitives (screen parallel slices, actually I make my tests with 1000 quads) for example to see much more performance increase…
Eh cass, I don’t very well understand the notion of coherence you talk about, what is it exactly?

Hi TheAlbert,

The coherence issue has to do with how graphics processors are designed to work. If every fragment takes a different path through a fragment program, then lots of things (like pipeline efficiency and cache coherency) just go away.

GPUs take advantage of an “economy of scale” by having deep pipelines. Branching can be disruptive to the pipeline if it happens too

Does that help?

Thanks -

Hi cass,
If I well Understand, I think that means data spacial repartition has an important impact on performances. If non transparent voxels are grouped it will be better than if they are spacialy distributed. I heard somethings about the notion of quads, it may be linked to that.
Another strange thing is the behavior of the IF statment, I get half tjhe performance of the RET even though it seams it should logicaly do the same in my program…
Thank you for your help, it helps me a lot :slight_smile: