Kernel Memory Error (illegal access inside kernel) - why?

The code inside my kernel has been tested in a single threaded environment (it was part of a doubly nested for loop).

The ‘object’ that the kernel should be operating on is a triple pointer to a struct I made (cell). It is defined as such:

cell ***cellGrid

I access it like:

cell[height][width]->value = 10;

That all works fine in the single threaded (CPU only) version.

So now with the parallel version, the kernel is giving me memory errors, which is reporting the following error: “Error info: CL_OUT_OF_RESOURCES error executing CL_COMMAND_READ_BUFFER on GeForce GTX 280 (Device 0).” - which isn’t too ambiguous.

I commented out the contents of my kernel, it does not crash. I added the following line to the kernel and it crashed.

cellGrid[0][0]->value = 5;

Here is how I’m copying the data onto the GPU, reading it from the GPU, and calling the NDRange kernel:

clEnqueueWriteBuffer(queue, gpu_memory, CL_TRUE, 0, GRID_DATA_SIZE, **cellGrid, 0, NULL, NULL);
clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global_workgroup_length, local_workgroup_length, 0, NULL, &event);
clEnqueueReadBuffer(queue, gpu_memory, CL_TRUE, 0, GRID_DATA_SIZE, **cellGrid, 0, NULL, NULL);

global_workgroup_length is a 1D, 2 element array of size_t which has the values 600 and 600.
local_workgroup_length is a 1D, 2 element array of size_t which has the values 20 and 20.
GRID_DATA_SIZE = 600 * 600 * sizeof(cell)
cellGrid is a properly allocated triple pointer to a cell.
gpu_memory is a valid cl_mem object (status of CL_SUCCESS), created with the following line:

gpu_memory = clCreateBuffer(context, CL_MEM_READ_WRITE, GRID_DATA_SIZE, NULL, &status);

I feel the issue has something to do with me copying the data over to the GPU (or reading from) but I don’t see how that would matter. Though it’s a triple pointer, if I de-reference twice, then it should be copying the actual cell structs, which is what I want. This comes into the kernel as a pointer to a cell struct, but I immediately cast it to a triple pointer to a cell struct, via:

pixel ***cellGrid = (cell***)_cellGrid;

I’m still questioning the legality of how I’m copying and retrieving from the GPU. I feel that as long as I’m staying in the bounds of the height and width (both 600), as the memory is contigious, I should be fine…though I still feel as this can be a problematic issue. Is what I doing legit? Are there other problems you see?

On one hand I feel as it’s contiguous it should work…on the other hand I feel as though the “intermediate” pointers may have been lost in translation. Newb C fail?

Any help is appreciated.

I forgot to mention, the max workgroup size is 512, my local work group size in both dimensions is 20, so 400 < 512.

I’m hoping to access my 2D grid inside the kernel via:

int x = get_global_id(0);
int y = get_global_id(1);
cellgrid[x][y]->value = 500;

Though right now it won’t even work with 0’s for both indicies…

Just another note (it won’t let me edit that post again :():

cell a;
***cellGrid = a;

inside the CL code works.

cell a;
cellGrid[0][0]->value = 10


Now sure, they’re not quite the same, but I thought they’d have the same effect…

This also works:

cell a;
cellGrid[0][0][0] = a;

But this crashes:

cell a;
cellGrid[0][0][0].b = a.b;

Something to do with accessing members of the struct maybe? That doesn’t make any sense, the CL code includes the same file where cell is defined that the host code does.

As a general rule: avoid dense 2d arrays in C. C only directly supports 1d arrays and 2d ones are implemented as arrays of pointers to arrays of type: it leads to slow running code, and problems with stuff like this because the array data isn’t stored in a single contiguous location (I have no idea about C++, if that is what you’re using).

Personally I would just use a 1D array and calculate the indices myself (x+y*width), and then everything becomes obvious. (actually I wouldn’t even use structs unless they were small because sparse access is slow)

You shouldn’t even be able to create a triple pointer argument, so if you have that then it’s just a compiler bug (check for warnings). You should need to tell it the array dimensions, e.g. “global Cell foo[600][600]” (I think). Notice that is an array of items not pointers, and that the size is hard-coded making it pretty limited in usefulness (beyond fixed-size things like affine matrices).

And you definitely can’t pass an array of pointers to a GPU or any structure containing pointers, as the memory is physically separate. All you’re writing to the GPU with that triple-deref is the first slice of the array and who knows what else.

Simplest solution would be flat array:

Cell foo = malloc(600600sizeof(Cell));
Cell foo[600

// note: not a pointer dereference
foo[blahblah].value = 10;

And opencl side
kernel void …(…, global Cell *foo, …) {
int x = get_global_id(0);
int y = get_global_id(1);

// note: not a pointer dereference
foo[x+y600] .value = 500;
// or if you like using ‘->’
foo += x+y
foo->value = 500;

For various other reasons the way you’re indexing things will be inefficient too. You’ll have the first 20 work items grab nearby memory, then the next 20 will go a whole Cell width away, etc.

I ended up fixing the bug exactly as you described, I created a function in the host code to naively convert it to 1D, and these particular issues seemed to go away.

Yeah, I guess the pointer thing makes sense, it’s just confusing because it’s all contiguous, so even though the pointers are bad, it should still work. I mean, it shouldn’t, but it’s contiguous :-.

I didn’t create the argument, I just passed the pointer in and then accessed it normally as if it was an array. The problem there, though, as you pointed out, is that the other 2 pointers are in host memory, which is useless. Bad casting habits :(.

Yeah, wow, looking at the code and your description I can’t believe I did what I was doing. There’s no way it’ll work with the other 2 pointers in host mem. I guess I just thought since it was still guaranteed as a contiguous block, when I copied the whole thing in, by the size parameter, it should be the whole block, then I can treat it however I want in the GPU side, assuming the size is known. You get what I mean? I feel like it shouldn’t matter because it’s contiguous… Though I guess as long as they’re dynamically allocated in separate chunks (malloc only guarantees contiguous allocation for the single block) there’s no telling where the rest of it is.

Any way you can explain in short why accessing is inefficient? I’m very new (obviously) to parallel designs, I thought taking it chunk by chunk would be the most efficient. I gotta stick all 360,000 cells in there one way or another. So since the local workgroup is of size 20, I thought it’d take a chunk of 20, operate on them, take the next chunk of 20, operate on them, etc. I don’t see how it could be more efficient.


Your local worksize is 20x20 - so it’s actually doing it in groups of 400, well sort of. Each CU will run those 400 work items together but there isn’t enough hardware to physically run it at the same time, which means it breaks it into wavefronts - a fixed size which matches the hardware width - until they’re all done. Using AMD as an example it has a wavefront of 64 items (it’s actually implemented as 4 cycles over a 16-wide SIMD processor, but that detail is not important).

So it will sequentially run 64 work items, then another 64, until it gets to 400.

So in pseudo-parallel-c it’s doing something like this:
width = 20; // local work size
height = 20;
for (waveid = 0; waveid < width*height; waveid += 64){
parallel int threadid=[0…63]{
localid[0] = (waveid +threadid)/ width;
localid[1] = (waveid + threadid) % width;

// suppress any threads out of range
enable[threadid] = (localid[0] < width && localid[1] < height));
run work item;

(actually it could be doing something else, that’s just how i imagine one would do it to most efficiently fit the hardware).

So your memory patten is actually:

wavefront 0:
0,0, … 19,0
0,1, … 19,1 - this is 600 * sizeof(Cell) away from the line above
0,2, … 19,2 - etc.
… 3,3

wavefront 1:
3,4 … 64 of those

So you’re really going to get 1/4 of your potential memory bandwidth here as bits of the work-group is accessing 4 widely separated chunks of memory (and it’s worse, since they are structs, see below).

The details of the hardware might make it more complicated than that, but i think that sums up the high level situation. For example there are normally multiple CUs working concurrently but one can’t really consider what they’re doing in terms of algorithm development other than that they might be there to make your total job run faster.

For array data, a local work-size of 64x1 (or some small higher multiple) works pretty well. Just round-up the global work size to this increment, and have a range-check in the kernel - the range check will be a lot cheaper than wasting SIMD lanes or memory fetches. This can be used for 2d or 1d data, but of course for 2d data it helps if the width is a multiple of 64 to start with or at least close.

For a 64x1 case the work groups will access memory like this:
wavefront 0:
0,0, … 63, 0
that’s it.

(depending on the algorithm/data access/implementation, 64x1 might not be ideal, it’s just a starting point, other things to try for arrays are 64xN where N=1, … 8, or 64*M,1 where M=1, … 8, or some combination thereof. For images, 8N,8M seems better.)

As a rule of thumb, unless it’s a particularly 2d algorithm (and/or can benefit from LDS in a 2d work topology), or unless i’m using images to store the data, I don’t use 2d work groups.

e.g. rather than use a 2D workgroup size of roundup(width, 64), height, 64, 1, just use one of roundup(width*height, 64), 1, 64, 1: the data is already flattened to a contiguous array so it simplifies the addressing and the range checks.

Using structs …

If you have a struct:
foo {
int a, b, c, d;
float x, y, z, w;

If you access foo[workid].a you’re accessing only 4 bytes every 32, and that just doesn’t mesh with the way gpu’s access memory. Again you’re down to 1/8th of your potential bandwidth (and quite likely the cache wont help here because by the time you’re back to local work item 0 you’ve iterated through 12k of foo’s … with a typical 8k cache).

If you’re going to access all struct items anyway it wont be so bad - hopefully the compiler will at least put the memory accesses together if they can be, and the above should hopefully just be 2xtype4 accesses. The async copy functions can be used to copy a chunk of structs to LDS as well (all threads will work in a coalesced way until done). But i normally just use primitive (vector) arrays or images.

If you’re only reading memory once, and doing a lot of work anyway - it wont really matter, but if you have a near 1:1 fetch/write to alu ratio it adds up fast. Your memory access pattern isn’t disastrously bad (assuming you’re not accessing few fields of some giant struct), so it isn’t the end of the world - but it’s easy to make it better too.

Look up “struct of arrays vs arrays of structs” for further background - it’s a general memory optimisation technique for vector cpus (cell b.e., or even SSE) and gpus.