Buffer Vs Image


I am working on Image Processing Operations with Filters which are about 1/4 to 1/6 of the image size. I have just testet my code with image2d and buffer objects. I have noticed that buffers are twice as fast as images in my case (1.2k² image, 200² Filter). Does anyone of you know if I would benefit from Image Read, Buffer Write? Or is Image writing fast enough?


My expectation is that you would benefit from using images:
[li]Pixel coordinates normally have to be converted from a 2D coordinate to a linear address in the buffer. Using texture lookups, you can avoid that conversion.[/li][li]Type casts from your image format to 32-bit float is free when using texture lookups. Not an issue if your pixels are of type float already. The reverse is also true - type-cast from floating point values to your output image format is free with texture writes.[/li]
I am a little surprised that you found using buffers to be twice as fast as using images. What data type are you using for your pixels?

As a side note, convolving images with a 200x200 filter is going to be slow. While I haven’t used it myself, I know that if you convert your filter and image to the frequency domain (2D FFT), multiply the resulting matrices, followed by using the inverse FFT, then you have calculated the convolution of the image and the filter. For large filters, this can be faster.

If you have regular grid access that can be fully coalesced at every access, and the data is stored as floats (requiring no conversion), and you don’t need the image features (interpolation, etc), using arrays can be significantly faster.

Images are best for image data used in an image way. i.e. 8-bit-per-channel type stuff.

But for the filter sizes you’re talking about, I would think using an FFT would be measurably quicker as chippies suggests.

Hello and thanks for your answers.

I dont think I can use fft for my convolution operation, becaus it isn’t a normal image convolution but a maxima search. So I think I will not profit from that.

My Pixeltype is float32 because I need full float accurancy (maybe even more) for my calculations. So there will be a speedup in the lookup of image data when I use normalized texture coordinates?


@notzed: I have been working under the impression that using images and texture lookups would be faster, even when working with single channel data, because it avoids the maths needed to transform values from get_global_id(0) and get_global_id(1) into a linear index. My data is normally 2D grey scale images.

Can you explain why arrays are faster in the conditions you listed because I would like to correct my misconceptions:

Coalesced Memory will be read in 128 byte chunks. If you have your data acces correctly aligned, there is only one read operation from global memory for the whole warp to get the data. even if this memory acces is expensive, the whole warp might be paused until the access to the data is granted. This can give you a speedup in some cases. But it should be the same for image data, or don’t you read image data in 128 byte chunks as well?

Note I specifically said: “If you have regular grid access that can be fully coalesced at every access”

At worst case, a single multiply is cheap compared to a non-fully-coalesced memory read, so unless you’re doing a lot of address calculations (compared to the amount of real work) they will not be a problem. A typical cache-miss memory fetch might be in the hundreds of cycles, a multiply might be 6.

“Coalesced” texture reads are harder to judge - none of the vendors document the internal format they use for 2d textures, and therefore you cannot write a kernel with guaranteed-best-access pattern. All we know is that it is optimised for localised 2d access - i.e. texture lookups.

Apart from that, when I lasted worked on nvidia hardware (gtx 480), I saw the ISA had special-case code for each possible image format - implemented by reading a structure on the gpu at run-time and going through a case statement. This blew me away at the time, but seems to be a consequence of being able to pass any image type to a kernel, and not putting the format into the sampler which would allow compile-time optimisations. AMD cards seem to put more into the hardware from memory, but it’s been a bit of a while since i looked that closely at isa dumps.

I’ve also seen code on some cards to do stuff like normalise coordinates (i.e 0-1) as some texture fetch units only work with those - which requires some multiplies anyway.

So, although we all know the hardware is there, the way “read_imagef()” is converted into code may not be as simple or as cheap as we would all like.

I tend to use images for byte-oriented data because of the ‘free’ float conversion (and byte access of arrays just isn’t too efficient to start with) and it’s just easier to write some algorithms that way, but if I am working in floats tend to just use arrays (and an array access is easier to write than calling a function that always returns float4). Often even if one does have 2d data you can just process it as a 1d array anyway, and that can be both simpler and more efficient (e.g. no 2d address calculations, only one ‘end’ case rather than one per line, ability to use persistent kernels easily, etc.).

In the past I did a lot of benchmarking and almost exhaustive testing of variations of various algorithms, but now i usually don’t bother as i have a good idea what will work fairly well based on all that previous stuff. Together with more complex code and better hardware and drivers - one tends to just choose the easiest approach to the problem at hand - images for image data, arrays for float data and only look more closely if the performance is well out of expectations.


Thanks for the explanation notzed. I was not aware that texture lookups were sometimes implemented in as you described. I had assumed (incorrectly, making an ass out of me) that Nvidia and AMD would have built-in instructions in their GPUs to handle all of that. Thanks for correcting me.