Adding up on the gpu.

If I have a chunk of data in video memory (from glreadpixels) and I want to sum all the ubyte values (preferably taking advantage of the GPUs SIMD capabilities). How can I do that?

Is it something that’s possible in cg?

Right now I return all the data to system memory and sum it on the cpu but this is a bit slow. It makes sense to do the summing on the gpu since readpixels to video memory is about 9 times faster (1.5Gb/sec for me) than to system memory.

I’ve looked at register combiners and texture shaders and I don’t think it’s possible with those. Maybe I could do it with a vertex program (pretend each rgba is an xyzw) and add up all the vectors?

I will do it in cg if I have to but I would prefer a solution that works in hardware on existing cards.

you don’t have to use cg, it does not provide additional features, just a nicer (??) way to do it, sometimes…

for your problem. you want to really sum up, or average? because when summing up, you need fairly new gpu’s, with floating point buffers, because else you get nowhere to do it with unsigned bytes…

if you want to average:
render to texture (or copytexsubimage), and draw that texture as quad on screen, 4x smaller (width/2,height/2), with GL_LINEAR… if the texcoords are correct, you get at each pixel the average of the 4 pixels.

do that recursively till you are at width or height == 1… then you have the average over all pixels…

Thanks Davepermen, I really need the sum or an exact average. In most cases the average would be a low number (<10) so precision issues would be significant with the averaging method you describe.

For the average, using the auto mipmap generator is one solution (used by some people). Just query the 1x1 mipmap and there you go.

For the sum, I dont know because if you keep adding up, then clamping becomes an issue, whether you have float or ubyte. The accum buffer is the sort of buffer needed in this case in which you render on the same pixel and have additive blending enabled.

If you have an idea for performing the sum, post it!


The mipmap idea won’t give me enough precision. I do some more calculations with the final number and a single byte will give me artifacts. I could compromise and just reduce to 16x16 or something but I would prefer just to be able to sum and divide by the number of pixels. There is also an overhead of calling things like glCopyTexSubImage2D and since I want to call it 10,000 times per second it becomes significant. Also glCopyTexSubImage2D is crashing when I have GL_READ_PIXEL_DATA_RANGE_NV enabled right now.

It’s frustrating when I can do it on the CPU in one simple line of C code. (using IPP, SIMD optimised and one line of code). Is this kind of flexibility on the gpu planned or seen as being too rare a requirement to be necessary? The fact that you are also interested in a ‘sum’ function suggests it may be more widely useful.

My vertex program idea was a non starter after I did some more investigation.

[This message has been edited by Adrian (edited 12-22-2002).]

you can doit in CG using FP30 which has th f4tex2D(samper, texcoord);
if you are able to iterate all the texture surface maybe you can do something…

Here is something that I played with on a geforce2 a couple of years ago. I wanted to do something similar to you,i.e sum up the values in the framebuffer without haviing to do a glreadpixels on them all and use the CPU. What I did was (fold the frame buffer, with additive blend enabled). for example, if my area of interest was 64x64 pixels. I copied half of the area on top of the other half in effect summing the two. I then continued copying half of the top half onto the other half.

12 or so copies later, and you have the sum of the 64x64 pixels in one pixel in the top lefthand corner. This one value can then be read using glreadpixels and operated on on the CPU.

Now, as previously mentioned a byte will quickly overflow, but INTENSITY16 might give you enough headroom. The downside to this for me, if I remember correctly, was that it didn’t turn out to be any faster than reading the entire framebuffer and then using the CPU.

Perhaps it was my implementation, I suppose you could try it.

It doesn’t surprise me that it turned out slower like that. That’s quite a few extra copies and renders. I need a very fast and accurate solution. ARB_READ_PIXEL_SUM plz

What is the range of the numbers you want to add? How many? How are they laid out?


I want to add about 128x128 RGB pixels = ~48,000 unsigned chars (0-255). The result would need to returned as an integer. It would be useful to have the sum for each colour channel. The pixel data is continous in video memory.

I’ve decided the extension I suggested may not be a good idea since it would be better if it could do more than just sum. There are other image functions that might be useful such as spatial moments. Intels performance primitives library contains lots of useful image processing functions but of course they are all performed on the CPU.