Problem with frequency algorithm

Hey guys,

i have a problem with an algorithm.

I have point lists with x- and y-coordinates. I calculate a result for each combination of the points with a lot of operations in each opencl process.
Then I want to count how often the different results have been found (frequency of each result).
At the moment the calculation of the results is done in a kernel. The results are saved to list.
In the C+±part i count the frequency in an array…

Is it possible to make this in OpenCL? I tried it, but i got problems. I think it couldn’t work because it isn’t synchronised.

Can anybody help me? I need an information about this or an information to find informations about this.

Thanks!

Basically what you want to do is to compute a histogram in parallel. I believe both NVidia and AMD include that in their list of examples.

This is right. Thanks!

In the actual version i merge the results in C++. So i have a number of global threads of ~5000000 (~3000 points). My “histogramm”-memory has 200*200 dimensions.
Now i have to use local threads.
What would be the best distribution?

I tried the following properties:

Kernel arguments:

// set the argument list for the kernel command
	clSetKernelArg(kernel_houghaccu, 0, sizeof(cl_mem), &inputX);
	clSetKernelArg(kernel_houghaccu, 1, sizeof(cl_mem), &inputY);
	clSetKernelArg(kernel_houghaccu, 2, sizeof(cl_mem), &inputZ);
	clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);
	clSetKernelArg(kernel_houghaccu, 4, sizeof(cl_mem), &completeArray);

Kernel startet with:


size_t globalThreads = 1024 * 1024;
size_t localThreads = 256;

clEnqueueNDRangeKernel(command_queue, kernel_houghaccu, 1, NULL, &globalThreads,
		&localThreads, 0, NULL, NULL);

For a test i tried following kernel_code:

__kernel void houghaccu
		(__global float *inputX, 
		 __global float *inputY,
		 __global float *inputZ, 
		 __local  int *sharedArray,
		 __global int *completeArray)
{
	size_t id = get_global_id(0);

	size_t localId = get_local_id(0);
	size_t groupSize = get_local_size(0);

	/* initial any field with 0 */
	for(int i = 0; i < 201*201; i++)
        sharedArray[localId * 201*201 + i] = 0;

	barrier(CLK_LOCAL_MEM_FENCE);
    
	int value = 1;
	
    /* calculate thread-histograms */
    for(int i = 0; i < 201*201; ++i)
    {
        sharedArray[localId * 201*201 + value]++;
    }
    
    barrier(CLK_LOCAL_MEM_FENCE); 

	/* merge all thread-histograms into block-histogram */
    for(int i = 0; i < 201*201; i++)
    {
        uint binCount = 0;
        for(int j = 0; j < groupSize; j++)
            binCount += sharedArray[j * 201*201 + i];
            
        completeArray[i] = binCount;
    }

}

I don’t understand which mistake i have done. compleArray[i] is always empty. But at one position it should be filled.

Did you get any error codes from any of the API calls you made, such as clEnqueueNDRangeKernel()? Did you pass a pfn_notify function to clCreateContext()?

I doubt that this like of code will work:

clSetKernelArg(kernel_houghaccu, 3, groupSize * 201*201 * sizeof(cl_int), NULL);

It is requesting 40MB of local memory and your hardware probably cannot support so much.

I tried it now with a group size of 8.
Now the is used 1,25 MB.

This isn’t the “main” problem…

Could you answer the other questions about error codes, etc?

As for the usage of local memory, 1.25MB is still a lot more than what your hardware probably supports. You can query the amount of available local memory with clGetDeviceInfo(…, CL_DEVICE_LOCAL_MEM_SIZE, …).

As long as clEnqueueNDRangeKernel() requires more local memory than is available on your system the program will not work.

Sorry, i mistake CL_DEVICE_MAX_MEM_ALLOC_SIZE for this value…

CL_DEVICE_LOCAL_MEM_SIZE only has 16kByte.

Now i have a problem.

In my old version (without groups) i merge the results in c++ slowly. Is it possible to do it in one “histogram” if it is synchronized?

Now i tried the counting of the results in one “histogram” without synchonizing.

~53000 (of ~5000000) results were not counted in this version…

You can compute partial histograms within a work-group using barriers and finally add together the partial histograms using atomic operations like atom_add().

Yes, i see the option with the atomic operations.

Work groups won’t work because one instance of the memory needs ~160kB (201 * 201 * sizeof(int))… Do you see an other way with work groups?

I will try to increment the histogramm memory with atom_add(). But therefor i need OpenCl 1.1. I don’t have it at the moment.

Atomics are also available on OpenCL 1.0 through an extension. All you need to do is add this at the beginning of your kernel source code:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable

Although perhaps you are saying that your device doesn’t support that extension? In that case you can compute partial histograms, store them in global memory and then launch one more NDRange kernel with a single work-group to add together the partial histograms.

Now i updated my system to OpenCL 1.1. I had to change some things. Now i get this error message when compiling the kernel:

Try to compile the program... Error: Failed to build program executable!
Error: Code selection failed to select: 0xf764e38: i32,ch = AtomicLoadAdd 0xf764
1e8, 0xf76ce78, 0xf764d98 <0xeff5eb8:0> <volatile> alignment=4

This happens when i use:

atom_inc(&result[position]);

I got following informations with the NVDIA OpenCL Device Query about the CL_DEVICE_EXTENSIONS:
CL_DEVICE_EXTENSIONS: cl_khr_byte_addressable_store
cl_khr_icd
cl_khr_gl_sharing
cl_nv_d3d9_sharing
cl_nv_d3d10_sharing
cl_khr_d3d10_sharing
cl_nv_d3d11_sharing
cl_nv_compiler_options
cl_nv_device_attribute_query
cl_nv_pragma_unroll

So i can’t use atomic operations with my machine?!

The problem of the solution with partial histograms is that i need 201201sizeof(int) for each partial histogram. That can’t work because i the maximum possible to allocate are 16kB for local variables…

The problem of the solution with partial histograms is that i need 201201sizeof(int) for each partial histogram.

You can do it in multiple phases. Instead of computing a histogram with 201x201
bins, you can first compute partial histograms with (for example) only 201 bins each and then refine each of them in another step. Think of it as a multiresolution approach. That’s how I would do it.

If I were you I would go to citeseer.ist.psu.edu and try to find if there are some papers on that topic. It must be a well-researched area.