Element-by-element multiplication of complex matrix

[b]I have one question with the problem “element-to-element multiplication of a complex matrix” for size larger than 8000x8000. In my GPU (Tesla C2075) with simple implementataion, the time delay is approx 200ms and is 60% of total time. This is to make Fourier-based convolution for image filtering using clFFT.

if someone know an efficient method for this problem (element-to-element multiplication of a complex matrix) help me please.[/b]

This is how I did it:

#define nx              get_global_id(0)
#define ny              get_global_id(1)
#define nz              get_global_id(2)
#define Nx              get_global_size(0)
#define Ny              get_global_size(1)
#define Nz              get_global_size(2)

float2 mul(float2 a, float2 b)
    return (float2)(mad(a.x, b.x, -a.y * b.y), mad(a.x, b.y, a.y * b.x));

__kernel void multiply_and_add(__global float2* fx,
                               __global float2* fy,
                               __global float2* fz,
                               __global float2* dataIn,
                               __global float2* dataOut)
    int N = Nx * Ny * Nz;
    int pos = nx + Nx * ny + Nx * Ny * nz;
    dataOut[pos] = mul(fx[pos], dataIn[pos]) + mul(fy[pos], dataIn[pos + N]) + mul(fz[pos], dataIn[pos + 2 * N]);

I work with 3d data (vector fields). In the last row each component (i.e. x, y and z) of the vector field stored in dataIn is multiplied with a certain factor taken from fx, fy or fz. You can easily simplify this to your problem.
The complex numbers are stored as float2 data type with first component being the real part and second the imaginary.
If you don’t need the original data you can store the result directly in the source array.

thank for answer MaximS