# kd-tree nearest neighbour

Hello again, I’m using OpenCL to find the nearest neighbour between two set of 3D points.

This time I’m using kd-tree for the model.

First I build the kd-tree and then I pass it to the GPU.

I’m representing the tree as an implicit data structure (array) so I don’t need to use pointer (left and right child) during the search on the kd-tree.

I’m using this properties to represent the tree as an array:

root: 0
left: index2+1
right: index
2+2 The main problem is that kd-tree nearest neighbour is a recursive algorithm, so I’ve converted it to a non-recursive form based on a stack.

Here my non-optimized kernel:

``````
float distanza(float4 a, float4 b)
{
float dx, dy, dz;

dx = a.x-b.x;
dx *= dx;

dy = a.y-b.y;
dy *= dy;

dz = a.z-b.z;
dz *= dz;

return dx+dy+dz;
}

bool isNull(float4 p)
{
if(p.w==-1)
return true;

return false;
}

int findLeaf(__global float4 *model, float4 qPoint, int model_size, int cur)
{
int best_id = -1;
int asse;

while(cur <= model_size)
{
asse = model[cur].w;

if(qPoint[asse] < model[cur][asse])
{
if(cur*2+1>=model_size || isNull(model[cur*2+1]))
{
if(cur*2+2>=model_size || isNull(model[cur*2+2]))
{
best_id = cur;
break;
}
else
{
cur = cur*2+2;
}
}
else
{
cur = cur*2+1;
}
}
else
{
if(cur*2+2>=model_size || isNull(model[cur*2+2]))
{
if(cur*2+1>=model_size || isNull(model[cur*2+1]))
{
best_id = cur;
break;
}
else
{
cur = cur*2+1;
}
}
else
{
cur = cur*2+2;
}
}
}

return best_id;
}

__kernel void
nearest_neighbour(__global float4 *model,
__global float4 *dataset,
__global int *nearest,
const int model_size)
{
int g_dataset_id = get_global_id(0);

float4 qPoint = dataset[g_dataset_id];

int stack; // 7 is enough for the number of points in my model
int top = 0;

stack[top] = -1;

int node = findLeaf(model, qPoint, model_size, 0);

int nn = node;

int lastChild, asse, otherSide;

float bestDist = distanza(qPoint, model[node]);

while(node != 0)
{
lastChild = node;

node = (node - 1) / 2;

if(stack[top] == node)
{
--top;
}
else
{
float parentDist = distanza(qPoint, model[node]);

if(parentDist < bestDist)
{
bestDist = parentDist;
nn = node;
}

asse = model[node].w;

float testDist = model[node][asse] - qPoint[asse];
testDist = testDist * testDist;

if(testDist < bestDist)
{
if (node*2+1 == lastChild)
{
otherSide = node*2+2;
}
else
{
otherSide = node*2+1;
}

if(otherSide < model_size && !isNull(model[otherSide]))
{
int testNode = findLeaf(model, qPoint, model_size, otherSide);

testDist = distanza(qPoint, model[testNode]);

if(testDist < bestDist)
{
bestDist = testDist;
nn = testNode;
}

++top;
stack[top] = node;

node = testNode;
}
}
}
}

nearest[g_dataset_id] = nn;
}

``````

The code return the correct nearest neighbour (except for few points but this is not a problem right now). The real problem is that is slower on the GPU then on the CPU.

Results (model 100.000 points and 100.000 query points): Also it crash on ATI 5850.

What can I do to make it faster? I was thinking about storing the model in local memory, but I don’t know if I can store 100.000 float4 on it, also I’m not sure if it will improve performance.

Any suggestion?

Thanks

Stefano

Stefano,

I’ll be completely honest here. It seems like you writing a Master’s Thesis and it feels wrong to answer your questions. If you had some specific problem where you were having problems with OpenCL I would be happy to help, but what it looks like is that you have not done much if any basic research and are expecting others to do the difficult parts of the thesis for you. That’s not the purpose of a Master’s Thesis.

All I’m going to say is: finding the nearest neighbour(s) of a set of points using a GPU has already been researched and there are published papers talking about it. The problems you are seeing today with GPU performance are rather simple and you will find the answer in either NVidia’s or AMD’s OpenCL programming guide.

Sorry to be so blunt but this post struck a nerve.

Alas, I do wish you had answered his question, as it would help me too! Stefano is the original author of the StackOverflow article whose kernel I have been using for my own edification.

While there’s tons of papers out there on the topic, there’s no specific “use this kernel with this input and get this output” the way there is with calculating inverse distance trees with Python: http://stackoverflow.com/questions/3104 … ith-python I mention this answer specifically because I am using the class mentioned here to do K-nearest-neighbor calculations for one of my projects, and while it works it’s just too slow – and I’m hoping something using PyOpenCL will do the trick.

Are there any implementations of KD trees and/or K-nearest-neighbor routines for OpenCL out there for folks to use? The closest I’ve found is http://www.i3s.unice.fr/~creative/KNN/ which is written in CUDA. I will try to port it to OpenCL but I know very very little about this stuff and I’d hate to reinvent the wheel poorly.

And I promise, there’s no master’s thesis coming out of my work. Jack.

The KNN work you’re citing is my PhD thesis’ work! I’m famous, youhou! I’ve spent days writing this code so I could help you porting this code in OpenCL.

What I can say about KNN is that the purpose of the kd-tree is in fact to make the algorithm absolutely not suitable for a GPU implementation (lot of random accesses, bank conflict, etc.). In fact, the point of my work was to show that when you think GPU, you have to change the way you think the solution to the problem. In CPU, kd-tree is nearly optimal, in GPU the exhaustive approach is way more optimal!

Some guys did write a CUDA implementation of the kd-tree but the result was not that fast and I think my implementation was faster in a general case. But of course it depends on your data.
For instance, in CPU, an exhaustive implementation is faster if you have 100 points in comparison to a kd-tree implementation.

So, mathuin, if you plan to code the KNN in OpenCL, I can really help you, at least to explain the different parts of my code (which should be quite clean but some parts are tricky).

Cheers,

Vincent Garcia

Thank you for responding, Vincent! I imagine if anyone could help, it would be you. My purpose for this code is to interpolate values from landcover data to help create superficially realistic Minecraft worlds. The USGS has files which have one datapoint every thirty meters. I want to interpolate to once every meter. Straight nearest neighbor is very ugly, but is supported by the GDAL tools that I am already using. Instead I am using a majority (mode) algorithm which provides much smoother results, but is much much slower. Another GDAL tool has the majority algorithm, but it’s not exactly something I can unplug from here and plug into there. One of my users suggested OpenCL. After all, most folks who generate Minecraft worlds have computers with nice video hardware, so maybe I could leverage that hardware to improve performance. So I started poking around and found your code! What little I could understand appeared to be size-limited to a total of 65536 and I couldn’t really tell why, so I started from scratch with that bit of code Stefano posted to StackOverflow.

The exhaustive approach was easy enough, and it can generate a 1024x1024 array of results from an identically-sized array in about five seconds. Increasing the size to 2048x2048 increases the time to about eighty-five seconds. Not exactly scalable. I am thinking of breaking the large (not unreasonable to see 8192x8192 total size) array into smaller five-second-sized pieces, but if you can suggest any tricks – or if you think your code can be adjusted to suit this sort of dataset, that would be grand.

Thank you again for responding, and for what you’ve written which has gotten me excited about making my video card pull its own weight. Jack.

Hi,

Don’t use the stack! Please check my paper implementing the octree traversal in OpenCL and IL for AMD GPUs. http://dx.doi.org/10.1016/j.jocs.2011.01.006

I put the kernel code on the paper. It is straitforward to modify it for searching the nearest neighbor. I’ve already modified the kernel for SPH and general neighbor search problems. I found it works on any OpenCL devices and actually the performance with a GPU is much better than that of a multi core CPU.

nn