What performance increase can we expect for a problem, e.g. generating NCn combinatio


Firstly, I would like to say that, I am new to this forum, and I would like to say “Hi” to any and everyone here! I hope everyone is having a good day, night, or whatever they are actually doing right now :).

Now, Secondly, since I am new to this forum, I am not absolutely if this is the right place for asking this, so, if it does happen not be, I will be very happy if you can let me know, and, possibly, refer to some other forum that is the correct place for asking a question such as this one :).

And, Third, and, finally, here comes my actual question, as I do THINK, at the moment, that this is the correct place for asking questions, such as this one.


I, and my team, are currently working on an application that is hopefully, some day, going to be used for statistical analyses in, and, after, creating statistical models, in order to, develop, prototype, and then, hopefully, try to verify these statistical models.

This, as you might imagine, involves a lot of Floating-Point-Operations, or FLOPs. As most of you know probably know, something that involves a lot of Floating-Point-Operations can take a lot of time, if the number of these operations does become VERY large. This is the case in my case.

In specific, I am working on a “sub-exercise”, if you so will, of this application, that generates a lot of combinations of Numbers. As many of you may know, for a given Set of numbers from 1,2,3,…, to N, if one wants to know how many combinations are in this set, of length n, where n>=1 and n<=N, then, this is expressed as N Choose n, or, sometimes, as NCn. This is the notation I am going to use from now on in this post. So, As I said, this part of the statistical application, is concerned with generating ALL the NCn combinations for any N and n given the properties I just gave. This means, if my application is supposed to create all combinations in, as an example, 20C10, then it is generating a total of 184756 combinations, in the set {1,2,…,20}.

Now, this is, still, a rather “small” example of what this application is supposed to do. Hopefully, it can, at some point, deal with Numbers N and n that are, at least, a bit larger than this.

The way the application generates these combinations at the moment, the execution of the code, or, algorithm, that generates these numbers is of Order ((N-n)!), or, of the order (factorialOf(N-n)).

Of course, any subsequent operations, that operate on this entire set, are also, then, of the same order.

As you have probably recognized, this as a “factorial” rate of growth, which, unfortunately, seems very large, but, at this point in time, none of us, at least in the team I am working in, have been able to find, or implement, an algorithm that can generate these NCn combinations at a faster rate, or that, in other words, has a lower rate of growth, related to N and/or n.

So, to, finally ( ;)), get to my actual question: What we are wondering here is, if, and what, benefit we can get from using “frameworks” such as OpenCL, CUDA, and other parallelization frameworks, in order to “benefit” from the additional processing power that is “inherent” in many End-User’s additional Hardware Resources, such as GPUs, DSPs, and any other additional Hardware Resources.

In this case, we are looking at OpenCL, since, most, of our customers seem to be working on machines, which have chipsets, Graphic Cards, Processors etc. that are manufactured by companies such as Intel, AMD, Nvidia, etc. All of which, at least to the knowledge we have, have adopted support of OpenCL.

So, again, the “main” question we have is: How, and to what performance benefit, can we possibly look at implementing OpenCL, in such a problem, such as the example given of the problem of creating NCn combinations, as well as, likely, other problems/algorithms, in the future?

We will be thankful to any and all input, also to input that can guide us to future resources on how, why, when, and where implementations of OpenCL can possibly improve overall end-user execution time.

We would also be interested in “conservative” estimates of what performance benefits COULD be possible, if these estimates are possible with the information given here.

Thanks already for any and all Input, and I do, sincerely, hope, that I, or my team members, will be visiting this, and other Khronos forums, more often!



Hello Hans-Albert,

welcome to the Forum. First of all, it is very hard to say things like “you will be n times faster than cpu” because in many cases algorithms that work perfectly on a cpu architecture run very slow on a GPU. The main question you have to ask yourself is: “How much of my computation is independent of other data?” If all of your data is independent, you will get lots of performance. If not, you wont.
You should definitly have a look at this udacity course: https://www.udacity.com/course/cs344
Don’t wonder that its CUDA, most of the things you have to keep in mind for parallel algorithms fit to opencl (GPU) computations as well.