fastest wide-width box filter

What is currently the fastest box filter technique?
I’m looking to implement blurs that work well (fast)
for radii of 50 pixels and beyond.

I’m familiar with the separable 2-pass methods but
the number of texture samples the shader needs to
take (e.g. >50) is not feasible here.

I’ve also tried using multi-pass mip-mapping: using
LOD bias and auto mip-map generation, and doing it
several times. But the result doesn’t look good enough.

On the CPU, it is possible to have a linear time
box filter (independent of width) by a 2-pass
method, where a scanline ‘accumulator’ scans thru
the X, and then Y axis, adding a new sample from the
right and subtracting old sample from the left. Is
it possible to implement this on the GPU? A naive
implementation would be to have 2 FBOs: one w x 1
and one 1 x h. That’d require rendering w * h quads
which I think is going to be too slow.

For an example of the linear-time scanline accumulator method:
http://incubator.quasimondo.com/processing/superfast_blur.php

it’s not possible to do that since you cant be sure of the rendering order of each fragment.

I would say that the separable 2-pass one can be made to look great as well as being really fast as you don’t have to sample every pixel, in fact a more random and weighted pattern can get a better result .

If you don’t want to do two pass then i would suggest a rotating random pattern sampling method, it’s pretty decent, it makes use of different sample patterns for it’s neighbors to get away with reducing the overall amount of samples, in fact even as few as one sample still gives a blurry (though noisy) image, and subsequent samples reduce the noise.

Off hand I don’t see any way to do a fast implementantion of that algorithm in opengl.

A separate convolution that also exploits bilinear filtering would actually only require radius/2 samples total per pixel (hence radius/4 in the “shader”). For very large radiuses you may consider multiple passes. When also using bilinear filtering additional passes will be cheaper and reduce the number of samples exponentionally. Nice thing is that you can still do at least approximate Gaussian (or whatever) blurs.

You can also improve texture cache efficiency by splitting your rendering into several smaller quads, the optimal dimensions (x and y) will depend on your filter and you should probably implement a brute force test to identify them.

I think what you’re describing is stochastic sampling.
The blur radius is controlled by how far apart the texels
are. Unfortunately what I need is an accurate box filter
so this method doesn’t work for me in this case. But thanks
for the tip!

Apparently there was a presentation at GDC03 that
described this N + M passes method, which as you said,
is not fast at all. Here’s the link:
http://developer.nvidia.com/docs/IO/8230/GDC2003_SummedAreaTables.pdf

I also found a paper from Eurographics 2005 that
does Summed Area Tables for linear time box filtering:
http://www.shaderwrangler.com/publications/sat/

I’ve actually implemented it and it looks decent.
The time taken is independent of the blur radius,
which is great. Unfortunately it requires 32f FBOs
which takes up a lot of bandwidth for most GPUs.
A 640x480 image takes up to 40ms per frame on my
8600M.

Thanks for the tips on filtering and cache coherency.
These are great secondary techniques and I was first
checking to see if there are any fast primary blur
techniques.