shader efficiency

minidrive · September 16, 2014, 8:02pm

Hi

I have read a few posts online and a powerpoint by Nvidia that state that if you use if-statement branches in the shader, it slows things down.
In particular, using a if statement on something like the threadid would be a no-no.

Since a threadid would be composed of things like gl_WorkGroupID and gl_LocalInvocationID, I would be correct to assume that the same restriction on having if statements would apply to these.Am i right to make such an assumption?

So to implement something akin to the pseudo code below, how could one go about to do it efficiently on a compute shader?

//if in a particular spot in the 3d grid (using gl_WorkGroupID and gl_LocalInvocationID
//do something
//else
//do nothing

I have also read some contradictory online posts where modern gpus are able to handle if statements because they effectivey execute both branches (by masking, only relevant is executed) . So branching is not supposed to be very costly.

i have a gtx570 GF114 archittecture - is this considered to be a modern gpu?
thanks

Agent_D · September 17, 2014, 8:18am

The problem with branches on GPUs is that GPU architectures typically have groups of processing cores with a central programm memory and instruction decoder per group.

An instruction is loaded from memory and decoded and the decoder lines are connected to a lot of cores that have an ALU, register set and local data memory. One instruction is executed by a lot of cores in parallel, working on different data, which is really really good for triangle rasterization or other data parallel tasks.

When a branch comes along in the code, that depends on local data (or anything that could potentially be different for different cores), the program flow somehow has to split if the branch condition evaluates different for the individual cores. I don’t really known how graphics cards handle that (split the program to two groups perhaps?), but it costs a lot of performance.

What you can use those ID values for is to fetch data depending on the ID of the core, so every core fetches different data from a buffer while executing the same fetch instruction in parallel.
So your “if in particular spot in grid” is basically “compute grid position from IDs; compute buffer offset from grid position; fetch; do something interesting”.

If your problem cannot be solved by an algorithm that reads elements from a stream of data and generates elements in an output stream of data (where elements can be grouped in a way that they can be processed independendly), then current GPU architectures probably won’t help you make it faster.

GClements · September 19, 2014, 1:37pm

What makes branching “costly” is that the GPU typically executes both branches.

Modern GPUs can actually skip over one of the branches if the condition is the same for all invocations within a workgroup. But that isn’t likely to be the case if you’re using gl_LocalInvocationID in the condition.