Confusion about sm3 branch execution

Hi there,

I am currently writing a terrain splattener using glsl where I would like to allow mixes between 8-12 textures. I’ve several ideas how I could minimize the textures a shader would have to deal with simultaneously (generating spezialized shaders for smaller areas) to about 4.

However this means I still would have 4-8 texture-reads per pixel although I would only need 2 effectivly.
Therefor I thought branching is a good idea reducing texture reads but to be honest I do not really understand how branches are performed and wether they bring advantages for me.

I would be really happy if somebody could take the time and anser my questions:

  • Is memory bandwith consumed as soon as I allow the shader to access a texture (definition of “uniform samplerXD” in shader) or at the time I access a textel using textureXD?

  • I read that under some conditions (at least for nvidia nv4x chips) when using branches both branches have to be executed also when using sm3 (as it was the case with sm2). For a texture splattener I guess using branches under this cirumstances won’t help at all. (e.g. it would choose a different branch quite often).
    Have there been enhancements to branching in the lastes generation GPUs? (NV7x and X1x00 series?)
    When is branching worth the trouble and when not?

Thank you in advance, lg Clemens

Memory bandwidth it consumed on fetch, that is, on your texture2D() calls.

On all hardware branches are done at a certain granularity. On X1800 series it’s on 16 pixels, on X1900 it’s 48 pixels. On nVidia it’s much larger, I don’t know exactly how large, but I’ve heard “about 1000 pixels” in the past. If the shader takes different paths within a block of the branching granularity, all pixels will be dragged along on both paths. This means that if you’re branching on a variable that varies wildly between pixels you won’t see any performance increase. If it varies more smoothly, you’ll have better luck. 16 or 48 pixels is not that many of the screen, so if you most of the time can go through the same branch in small neighborhoods you should be able to take advantage of dynamic branching.

For dynamic branching to really improve things that much you need to skip more than just a single instruction. The exception to that is texture fetches, especially on X1900 which is a beast on ALU but more average on texture fetches. But you shouldn’t have to skip a massive amount of instructions either to be able to see a difference. If you can skip perhaps 4 instructions, it may be enough to see an increase. The only way to know for sure is to test it yourself in your particular shader.

Thanks a lot, Humus :slight_smile:

1.) Really good to know the bandwidth thing - I guess this allows me to effectivly tune my shaders and also do some stuff to keep it runnable on older cards.

2.) Ah, well 48px is still ok - but 1000px for nv chips is … a bit ugly. I think I’ll let nvidia bleed for this design descision ^^

Again thanks a lot, lg Clemens

As far as I know, the NVidia chips operate in lock-step on a per-quad basis. By quad, I’m referring to the 2x2 pixel quad that things are rendered. You can see this by creating a texture that has a different solid color filled to each mip level. Then, use nearest-mipmap-nearest filtering, and you’ll see that the mipmap is chosen on a 2x2 quad basis due to derivative calculation. Now, all 4 pixels in the 2x2 block will take the same paths. Take this case for instance:

if ( flag )
gl_Color = doFunc1();
else
gl_Color = doFunc2();

If ‘flag’ evaluates to true for all 4 pixels in the 2x2 pixel quad, then only the code for doFunc1() will be executed. If ‘flag’ is false for all 4 pixels, then only the code for doFunc2() will be executed. However, if ‘flag’ is not the same for all of the pixels in the 2x2 quad, then both doFunc1() AND doFunc2() will be executed for all of the pixels in the 2x2 quad, meaning that you won’t get anything by using branching in this case particular case.

Kevin B

it’s true that the fragments are processed in quad-processors processing 2x2 fragments ( both NVIDIA and ATI cards do that ), however this doesnt mean that only these four fragments have to take the same branch. These quad processors just share some resources like L2 cache but they are still a SIMD machine so each of the quad-processor have to process the same instruction at the same time…

To take an advantage from dyn. branching the whole group of quads have to take the same branch. The number depends on many things ( like number of quad-processors, where lower number is better- that’s why for example 6600 is beter for dyn. branching than 6800 ) but somewhere I read that on NVIDIA it’s usualy necessary to have 256 quads which takes the same branch ( so it’s really about 1000 fragments )

but somewhere I read that on NVIDIA it’s usualy necessary to have 256 quads which takes the same branch
Link, please?

Also, what if your triangles don’t generate 256 quads? What if they only generate, like, 6? Or has the notion of triangles been obliterated by this point in the pipeline?

It’s some time since I’ve studied this so I dont know the exact location where I read it ( and I don’t think that it’s official ). However after short googling I found some links about this:
here is a link to page about NVIDIA cards ( 7800 x 6800 ) which show performance gain depending on the fragment bath size
http://www.behardware.com/articles/574-3/nvidia-geforce-7800-gtx.html

here is the same graph for X1800 x 7800
http://www.behardware.com/articles/592-3/ati-radeon-x1800-xt-xl.html

As I understand it all fragment processors ( ie. 24 frag processors or 4*6 quad-processor, it doesnt matter ) execute the same instruction for all fragments in one bath of fragments ( which appears to be around 1000 on 7xxx line ) and then they load the next instruction and do the same for the whole bath…and so on.
If you process only 6 fragments with one shader then you will be penalized not only by dynamic branching ( since no branching can be made if you process less than 1000 frags on nVidia ) but also with the fact that most of the fragment processors will be unused in that time.
And yes triangles have no meaning in fragment processing becouse the fragments are produced during rasterization which is one step before the fragment processing in the GPU pipeline.

There’s a beautiful overview of the 6800 architecture in GPU Gems 2 (not to shill or anything). They talk about this stuff and a good deal more. Even I can understand it.

Edit:

Some more good stuff on Nvidia GPUs:
http://developer.nvidia.com/page/documentation.html

This topic was automatically closed 183 days after the last reply. New replies are no longer allowed.