I recently started to use OpenCL to develop software. As first program I took a simple source code from a book and altered it. It compiles without problems and warnings but I still do not understand the result.

The program’s kernel should do the following: It gets an array with, for example, 128 elements and should compute the mean and standard deviation for every 64 elements and store the standard deviation in the output array in a certain position. So when computing an array with 128 elements, there should be two standard deviations in the output array.
Unfortunately, when I compile and execute the program there are four values in my output array and I do not understand why.

The globalWorkSize = 128 and the localWorkSize = 64, so the complete array with 128 elements is devided into two workgroups with 64 work items each, right?

you say that you are expecting 2 work groups with 64 work-tems in each and that you
pass in an array of 128 numbers. You then get the global id of each work-item
int gid = get_global_id(0);

which will have a range of 0 - 127 and you then use it to generate an array offset

const uint start = gid * 64; // has range 0 - 8128

which you then use like

sum = sum + temp[start + i];
sumPow = sumPow + temp[start + i] * temp[start + i];

so is tmp really that large (you don’t say how large it is but I assume it is only the same size as
src which i took to be 128 ?

Also, why did you do this
float mean[1] = {0.0}, stdDev[1] = {0.0};
rather than simply
float mean = 0.0, stdDev = 0.0;
?

Hello Jasno,
thank you for your reply! The kernel I posted first is terribly confusing and I altered it a lot in the last two days. I had a lot of things to do in the last week so that I was not able to check the thread for replies.

This is the actual version of the kernel:

__kernel void hello_kernel(__global const float *src,
__global float *sigma)
{
float sum = 0.0;
int global_id = get_global_id(0);
int local_id = get_local_id(0);
int group_id = get_group_id(0);
int local_size = get_local_size(0);
int global_offset = get_global_offset(0);
//Copy 64 Elements from src into the local memory
__local float mem[64];
mem[local_id] = src[global_id];
barrier(CLK_LOCAL_MEM_FENCE);
//Calculate the sum of mem and write the sum into the global array sigma
for(int i = 0; i < local_size; i++)
{
sum = sum + mem[group_id * local_size + i + global_offset];
sigma[(int)sum] = sum;
}
}

The problem is that the program calculates the sum of the first 64 elements correctly but the second 64 elements are not summed up or at least the sum is not stored in sigma. I am sure that sigma’s index is wrong, but I have no idea how to correct it.

I’m not clear what you are trying to achieve but I can tell you why the first 64 values
appear to work (assuming your workgroup size is 64)

you have

sum = sum + mem[group_id * local_size + i + global_offset];

where for workgroup1 “group_id = 0” so “group_id * local_size = 0”, “global_offset = 0”
so round the loop the only thing having any effect is the increase of i from 0 to 64.

In workgroup2 “group_id =1” so “group_id * local_size = 64”, so the index into the mem array
goes from 64 to 127 BUT you only allocated mem to be of size 64 so you are reading random values
from memory to form your sum.

Now to the real problem you have. The kernel is executed by every thread in your workgroup. You said that you launched with 64 threads per workgroup so this

__local float mem[64];

mem[local_id] = src[global_id];

barrier(CLK_LOCAL_MEM_FENCE);

is all OK, each thread is copying one value from global memory to local memory (into a different element of the mem array). However, you then go on to do

for(int i = 0; i < local_size; i++)
{
sum = sum + mem[group_id * local_size + i + global_offset];
sigma[(int)sum] = sum;
}

Here, EVERY thread is doing exactly the same calculation so you have no parallelism happening. They
all then write the same value to the same location in sigma. While you get the correct answer, you are not going to get any real performance gain since you are effectively doing this in serial.

The kernel should take the src array [size: 128 elements] and create two workgroups [size: 64 elements]. In the workgroups, all elements should be summed up and stored
into a sigma. After the execution of the kernel sigma should contain 2 results, the sum of workgroup1 and workgroup2.
I know, this sounds not very useful but it is part of a bigger algorithm which I try to implement in OpenCL. To have a simple start into OpenCL I broke the “big problem” down into smaller “problems”.

I am allready reading books about OpenCL and the OpenCL spec but nevertheless I have a lot of problems with it. I guess one of the biggest problems is, that I cannot use the debugger to analyze the kernel’s behaviour.

Is it possible to achieve parallelism with if-else statements or is there another, perhaps better, way?

So please know that I really, really appreciate your help!

You say “After the execution of the kernel sigma should contain 2 results, the sum of workgroup1 and workgroup2” and in your code you have

sigma[(int)sum] = sum;

But how do you know where the results have gone or that the array sigma is large
enough since you use the accumulated sum value as the index into the array? I
don’t know if there is some significance to this for your larger problem, if not then you should probably have something like

sigma[group_id] = sum;

so that the sum from workgroup0 ends up at position 0 in the sigma array, the sum from workgroup1 ends up in location 1 in the sigma array etc.

As to your more general question of how to do the summation in parallel, well its a fairly standard problem and is referred to as reduction. Try googling for it in opencl, there are many references to it. Basically the approach is, in a loop, get each thread to sum a pair of numbers and store the result in a local array. Each time round the loop half the number of threads adding pairs of numbers until you have a single result, then put that some into a known location in an output array (as above). You are then left with a set of partial sums (the number of sums being equal to the number of workgroups) which can be summed serially.

My kernel works now! After the execution sigma contains the results of both work-groups (sigma[0] = sum of work_group1 and sigma[2] = sum of work_group2).

__kernel void hello_kernel(__global const float *src,
__global float *temp,
__global float *sigma)
{
float sum = 0.0, sum2 = 0.0;
int counter = 0;
int global_id = get_global_id(0);
int local_id = get_local_id(0);
int group_id = get_group_id(0);
int local_size = get_local_size(0);
int global_size = get_global_size(0);
int first_workitem_in_new_group = ((get_local_id(0) == 0) && group_id);
__local float mem[128];
mem[global_id] = src[global_id];
barrier(CLK_LOCAL_MEM_FENCE);
if (get_local_id(0) == 0)
{
if(first_workitem_in_new_group) counter = 1;
for(int i = 0; i < local_size; i++)
{
sum = sum + mem[group_id * local_size + i];
if(i == 63)
{
sigma[counter] = sum;
sum = 0.0;
}
}
}
}

Before I read your reply, I tried to parallize the summation with “if” but I am not really sure if the summation is completed in parallel. I will google “reduction” later, but can you please tell me if my approach to parallize the summation works?

I’m afraid that you still aren’t going to have this running in parallel. If you think about what your “if (get_local_id(0) == 0)” statement is achieving, it is saying that for each work-item in the workgroup (of 64 work-items), if you are work-item 0 then come in here and do all of the computation, if not then sit idle until work-item 0 has finished then exit the kernel. The only parallelism you are getting is in the fact that you have 2 workgroups.