OpenCL Image Rotate/Scale/Translate, Affine Transform, ...?

Does anybody have sample source code for an OpenCL program for efficiently doing an affine transform to rotate, scale and translate a 2D array/image?

I want to see if I can write a GPU accelerated gpuRot() function to use in place of the IDL rot() function for 2D FITS images, and maybe a few 3D FITS images. The pixel types for these FITS images vary: 16 bit ints, 32 bit ints, 32 bit floats, 64 bit floats, 8 bit ints
And since it is for scientific analysis, I want to preserve the resolution.

I’m sure that I could write something for this in OpenGL (define frame buffer object, set up texture, set up xform matrix, render, read frame buffer object), but am worried about data value integrity and am curious to see how it could/should be done efficiently in OpenCL.

Sample OpenCL code?
Sample OpenGL code (for doing it with a frame buffer object and handling the different data/texture types)?

I can’t give you a thorough or tested sample, but I can suggest the “proper” way to develop the algorithm using a transformation matrix.

The parameters would be an input image and an output image, as well as a single transformation matrix (rotationMatrixscaleMatrixetc…)

__constant sampler_t sampler =	CLK_NORMALIZED_COORDS_FALSE |
						CLK_ADDRESS_CLAMP         |

__kernel void
post_process( 	__read_only image2d_t inputImage, 
			__write_only image2d_t transformedImage,
			__global float3* transformationMatrix )
	size_t x =  get_global_id(0);
	size_t y =  get_global_id(1);

	int2 coords = {x,y};
	uint4 inputPixel = read_imageui( inputImage, sampler, coords );
	int2 transCoords = { coords.x*transformationMatrix[0].x + coords.y*transformationMatrix[0].y + transformationMatrix[0].z, coords.x*transformationMatrix[1].x + coords.y*transformationMatrix[1].y + transformationMatrix[1].z };

	write_imageui( transformedImage, transCoords, inputPixel );

Note that in order to transform a 2d coordinate, you need a 3x3 matrix, as well as a third homogeneous value of 1, i.e.

[x,y,1] *	|tx0 ty0 tz0 |
				|tx1 ty1 tz1 |
				|tx2 ty2 tz2 |

You can create this transformation matrix using OpenGL functions or make custom functions (plenty of documentation on transformation matrices online), but you may have to create separate kernels for different data types you mentioned.

and an aside question, float3x3 is a reserved data type, but still unrecognized, does this mean it WILL be implemented? is there a better way to use a matrix right now than a pointer of float2?

Furthermore, this matrix operation could really benefit from intrinsic matrix types on GPU, so once this is available expect much faster performance.

Thanks for the reply.
It is very unfortunate that there are no OpenCL functions exposed for matrix math; I am mostly interested in hardware acceleration; I know this operation also must be heavily optimized due to the huge number of calls it must receive in OpenGL, …
I was thinking I would go the opposite direction and use the inverse matrix on the destination/transformed pixel location to compute the source pixel position and then copy.

I’m not worried about building the matrices, …
I am more worried about preserving pixel/value integrity.

I’ll have a go at coding something up and see how it goes…

As far as resolution preservation goes, your biggest problem will be when you transform the pixel coordinates. Suppose you have a point (0,0) and after the transformation it becomes (0.23,5.6), then you are not directly writing to a pixel in the output image, but a floating point 2d position between pixels. In this case, you could do an “anti-aliasing” kind of method, where you write the weighted average to a neighborhood of, say, 9 pixels. The problem with this is you will have read/write issues with the naive implementation (many pixel writes to the same pixels, how to blend and not overwrite? )

You could also up-sample the output image and play with methods of avoiding overwrites and later down-sampling to an appropriate resolution.

There are many other ways to approach this issue, just playing with some ideas, hope it helps!

rot() in IDL uses cubic interpolation.
I thought I would do some sort of weighted neighbor averaging.

The hardware math acceleration comes in the form of SIMD vector operations which are exposed as the vector types in OpenCL C (e.g. float4) and many built-in math functions and operators on those. You can build very efficient matrix math from these.

I guess I was thinking that such a heavily used operation (matrix coordinate transformations), might have its own optimized primitive rather than being unrolled into a series of other primitive operations.