I see what you’re saying. Not super-simple to just toss on the CPU. Given the shared locality, sounds like maybe a job for OpenCL/CUDA, or possibly ARB_shader_image_load_store. (BTW, I think I might have helped review your chapter.
If not, let me just plug the OpenGL Insights book – worth buying a copy when it comes out.)