How can I efficiently average vectors?

As a start in OpenCL I tried to make an efficient vector averaging kernel.
For reference the C for 5 vectors and 2 mil elements takes 0.06 seconds

	float* result = new float[totalNum];

	for (int i = 0; i < totalNum; i++)
	{
		float average = 0;
		for (int n = 0; n < num; n++)
		{
			average += vectors[n][i];
		}
		average /= num;
		result[i] = average;
	}

	return result;

afterwards I tried to make this kernel:

__kernel void avg_vector(__global const float* input, int num, __global float* output, int vectorSize)
{
    int idx = get_global_id(0);

    float result = 0;

    for(int i = 0; i < num; i++)
    {
        result += input[(i * vectorSize) + idx];
    }

    output[idx] = result / num;
}

but it runs at 1 second.
Is anything wrong I do?

Your typical GPU has hundreds to thousands of cores. To see a speedup you need to use an algorithm that utilizes those cores efficiently; which in an ideal world means all cores process some data independent from all other cores - out of the box summing values is not an ideal fit, because in order to add the n-th element to the sum you need the sum of the previous n-1 elements.
There are ways to make this type of computation fit better with the execution model of GPUs, the general term for it is ‘reduction’ or ‘reduce’ operation. There is an example in the OpenCL SDK.

1 Like