As a start in OpenCL I tried to make an efficient vector averaging kernel.

For reference the C for 5 vectors and 2 mil elements takes 0.06 seconds

```
float* result = new float[totalNum];
for (int i = 0; i < totalNum; i++)
{
float average = 0;
for (int n = 0; n < num; n++)
{
average += vectors[n][i];
}
average /= num;
result[i] = average;
}
return result;
```

afterwards I tried to make this kernel:

```
__kernel void avg_vector(__global const float* input, int num, __global float* output, int vectorSize)
{
int idx = get_global_id(0);
float result = 0;
for(int i = 0; i < num; i++)
{
result += input[(i * vectorSize) + idx];
}
output[idx] = result / num;
}
```

but it runs at 1 second.

Is anything wrong I do?