If dynamic branching isn’t supported, you’ll have to unroll the loop. The loop should cover a small range (say, 1/16th for a 4x4 pixel kernel) of the total list. Each pixel then determines its base index based on its location, and adds that to the loop range. It then can compute it’s own chunk of the list.
Check out http://gpgpu.org . They have some tutorials on list processing there.