Generating heatmaps in OpenCL

I have a piece of software that’s doing many manipulations of a vector of 4D data points, using OpenCL to do the heavy lifting. I have a default way of visualizing the data, but in the process of doing the development, I’ve found it helpful to create heatmaps as well. These help me see where the data is clustered and how it evolves over time. Obviously I can’t visualize all four dimensions at once, but just looking two dimensions at a time is helpful.

Unfortunately, I haven’t figured out any decent way to create the heatmap on the GPU; it seems that any two pixels can potentially have data dependencies between each other. Here’s the basics of what I’m trying to do, using C syntax:

        typedef struct {
                uint_t x, y, z, w;
        } vect_t;

         * A[] is an array of vectors. Each component of each vector is
         * guaranteed to have a value less than VECMAX.
         * M[][] is the matrix of heatmap data that this generates.
         * If M[x][y] has the value N, then there are N vectors where
         * v.x == x and v.y == y.
         * This also calculates the largest value of any cell in the
         * heatmap, so the values can be normalized.
        int heatmap(const vect_t A[VECSIZE], uint_t M[VECMAX][VECMAX])
                int max = 0;

                for (int y = 0; y < VECMAX; y++) {
                        for (int x = 0; x < VECMAX; x++) {
                                M[x][y] = 0;
                for (int p = 0; p < VECSIZE; p++) {
                        int x = A[p].x;
                        int y = A[p].y;
                        assert(x < VECMAX && y < VECMAX);
                        int nval = ++M[x][y];
                        max = MAX(max, nval);

                return (max);

The image that this generates looks like it’s made of a mesh of dots, and that’s much better for my purposes than the more “blotchy” heatmaps that I’ve seen more often. The 4D object that I’m visualizing can be extremely complex, and I don’t think that a heatmap which blurred pixels together would be very useful.

Can anyone suggest an approach for doing this that would run well on a GPU?

This looks like a simple histogram. Use stable sort to group together vectors of the same value. Than you do something along these lines:

kernel void histogram(__global uint4 A*, __global int*M, int width){
int id = get_global_id(0);
int lid = get_local_id(0);
int4 vec = A[id]

__local uint4 left_most;
if (lid == 0){
    left_most = vec;
int equals_to_left_most = (vec.x == left_most.x && vec.y == left_most.y) ? 1 : 0;

int sum = workgroup_reduce(equals_to_left_most);
if (lid == 0){
    atomic_add(&M[left_most.x * width + left_most.y], sum);

if (lid >= sum){
    atomic_add(&M[vec.x * width + vec.y], 1);


I don’t know if there are more efficient ways to do this, but it should work fine as long as average number of vectors of the same value >> workgroup size.

[QUOTE=Salabar;42898]This looks like a simple histogram. […]

I don’t know if there are more efficient ways to do this, but it should work fine as long as average number of vectors of the same value >> workgroup size.[/QUOTE]

Yeah, it is basically a histogram. Unfortunately I don’t think my workload is amenable to this algorithm, though. Looking at a few sample CPU-generated heatmap frames, over half the data points in the map are empty, and another substantial fraction of the points have value 1 (indicating that there’s only one vector in the source data with that value). Most heatmap frames don’t have even a single value that are as large as 1/4th of the workgroup size.

Well, you can’t know until you’ve tried it, right? It is safe to assume that sorting will take a substantial portion of runtime compared to even atomics overhead. This algorithm is optimal in a sense that it performs zero atomic operations on an empty heatmap pixel and strictly one compare-and-swap when it comes to pixels of value 1. In these cases there are no actual data races to speak of, therefore atomics should be less impactful than usual. There can be more state of the art algos, but those may turn out to be a premature optimization. Also, experiment with a work group size, i.e. make it equal to the width of your GPU’s compute units.