"random" kernel crash after running for minutes.... HEP!?

Photovore · July 11, 2011, 2:25pm

I have a thoroughly complex kernel processing audio input data. It will run for a couple of minutes, 60 times a second, and then hang. That’s on the GPU; on the CPU it will run for hours. The input data are constantly changing, but each variable is always within proscribed ranges. I have inserted test code before uploading the inputs to the kernel each frame; in this test code, I can force these inputs to be well below their valid input range, but it still will eventually crash. (Say the valid range for a particular input is 0->400; I can force it to 0->1 and it will STILL eventually crash. I can force it to be below 0.1 and it will still ultimately bite the dust.) However, if I force the input variables to zero, the GPU will happily dance for hours. Of course, that input-free dance is not so particularly interesting.

I’m at a loss so far, though I have clues. I can make it crash much faster than 2 minutes if an input variable is high in its approved range. I can make it crash in less then 10 seconds under the right circumstances. BUT, I can’t seem to back_off_of those certain circumstances such that they go away. As said above, I can force the input vars into ridiculously small portions of their valid range, and the kernel (let’s call him Harlan Sanders) will eventually go belly-up. BUT, if they’re forced to actual zero, no problems puppy, we can run all day long.

To repeat, I’m a bit at a loss - although I have things that seem like clues, I have not yet figured out what they are hinting at, though I’ve been trying for a few days. Frankly, I do not expect to find a real solution by asking here; whenever I stumble over a problem in opencl it seems that my fate is to be the first to articulate that particular problem. I guess this is part of the fun of being in on a technology during its infancy!!! BUT, I want to do some serious, sustainable work with this “baby” (or, maybe, “toddler”).

Op details: MacBook Pro 2010, OS 10.6.8, nv 330M GPU, xcode 3.2.5, shorts, teeshirt.

bonus P.S. for those who’ve read this far, including a related question:
My laptop, soldier that it has proved to be, is not powerful enough for the next stage. I must sell some stocks/bonds and purchase a Mac Pro. I’m looking at the ATI 5870. So, PERHAPS my problem will simply go away when I compile the .cl for the ATI??? Maybe I have run into a bug in the nV implementation. Maybe my kernel is so complex that I’m running into undetected resource limits (it’s 1300 lines of code). So, SINCE I run fine on the CPU, perhaps I’ll have no bugs, or different bugs, on the ATI card???

Any thoughts?

Thanks, guys & dolls –
Dave

Maxim_Milakov · July 12, 2011, 9:30am

50/50

I had a kernel which run well on AMD GPU and crashed immediately when running at CPU. It was a bug in the kernel.

david.garcia · July 13, 2011, 2:52pm

First you say it’s a hang. Then you describe it as a crash. Which way is it?

If it’s a crash and it depends on the values of the inputs, have you looked into your kernel to see if there are any pointer arithmetic that depends on the value of the inputs? I.e. if the inputs are large your code may be indexing into the latter portions of an array? Is it possible that you are reading data out of bounds?

Photovore · July 13, 2011, 8:53pm

Apologies for the imprecision, David, and thanks for the suggestions. It has properties of both – the kernel seems to hang, output values stop updating, system unresponsive to keystrokes or mouse clicks for usually 5 but up to 15 seconds (though iTunes still playing in bg and mouse cursor still moves), then clFinish yields Invalid Command Queue. Because of the 5 seconds I was thinking hang + watchdog. (But, I guess since I get an error message it isn’t really a “crash”, though, just “erroring out”, right?)

I’m up-writing a struct full of read-only parameters and passing its address as the only kernel arg. The struct contains i.e. “float fvar[20]”. The error does in fact seem to not occur if the host does not write changes into this array within the struct before enqueuing the write. However, the kernel will often run for minutes before failing, and the conditions seem as identical as they can be when processing fluctuating data within a defined range. The changing input value is between 0 and 1, and it is used to calculate a char value that is checked for range before being written to the output buffer.

• There are no arrays of variable size in the struct; also no pointers.
• I am using the notation “fvar[2]”, not “*(fvar+2)” to read the values – problem?
• I am using float instead of cl_float in defining the struct on the host. Perhaps not recommended but if that were the problem, again why would it run so well for a while.

I am Missing Something here, of course. I have mysterious clues. In the calculation that generates the char values, there are a number of individual steps, but they all involve straightforward arithmetic, and these steps don’t change over the life of a kernel’s operation. Values are repeatedly forced within range. Here’s the clue, though: when I “want” the kernel to fail quicker, I can force this input value high in its range. Never mind that it takes several attempts, or that it runs fine before failure with a value higher than that when it finally does fail. (Also, removing a line like " T = 1. - T; " where T was already between 0 and 1 made the error go away once [or, did it just fail to fail in the 5 minutes I tested that run?] – and that change should have no procedural effect, I’d think.)

Writing this down has given me an idea or two. One is to get my .cl file to work again as an xcode .c file so I can step through the intermediate values in the calculations. (I had a single chunk of code just over a week ago, with lots of #if…#else…#endif so I could copy one into the other, change one #define and go. Then I got lazy.)

…

Thanks for your input, Maxim. It probably is something in my kernel, but in my case my CPU seems easier to please than my GPU. 8)

david.garcia · July 14, 2011, 6:29pm

This is weird. Have you tried turning off optimizations in the CL compiler? You can pass “-cl-opt-disable” in the “options” argument of clBuildProgram().

Photovore · July 14, 2011, 9:09pm

Well, it’s exciting to know that I can do that!!! … but, no, it doesn’t seem to help.

I’ve been weeding out little things that shouldn’t matter … like a ternary operator that wasn’t supposed to work, and I cleaned up some math where I had workarounds to get the right function invocations from _builtin_overload because e.g. as it turns out a " - 1. " in an expression was taken as a double and needed a terminal f. There may be dozens of unnecessary little conversions going on that just aren’t tidy, and who knows what I’ll find. About ready to try copying my .cl into a .c file … let’s see if that can be done successfully in mid-sentence … well, that took about 5 minutes.

Now I have the same chunk of code (with the magic of #define) running on gpu/opencl, cpu/opencl, and cpu/xcode-c isolated with data types and algorithm tweaks that fit in opencl-c. (I also have the same algorithms working with doubles in a single thread and as a GCD fanned-out block – 5 different ways to run them, with performance jumps all along the way.) Now I can use xcode’s debugger to step through almost the exact code my kernel’s running. It doesn’t fail there, but I can look at everything it does while running a program that fails on the gpu. (My kernel includes an interpreter, and only some programs fail.) Surely I’ll find clues there! I’ll keep you posted…

StefanK · July 15, 2011, 1:05am

I have very less experience with OpenCL but the only case I have crashes after a couple of time is when I use structures that “would” need some padding bytes. For the compilere something like:
struct t
{
float4 t;
float t1,
float t2,
float t3,
}
is not valid whereas

struct t1
{
float4 t;
float t1,
float t2,
float t3,
float padding,
} is.

Maybe it helps…

Photovore · July 16, 2011, 2:46pm

Thanks for the suggestion, Stefan. I will look into that next, as I did not pay attention to padding when I designed the structure-full-of-params, and it is certainly possible that tweaks of alignment are necessary.

I have reduced the failure by at least an order of magnitude in the last few days, but not knowing until this morning. I have combed through the 1300+ lines of kernel, added a trailing “f” to ~150 float constants, put in over a dozen unneccessary range-checks within the kernel, and other stuff, all of which did not help at all. THEN, this morning, I put in a range-check before uploading data to the kernel, which I had done many times before but it did not help, BUT, combined with whatever I’d done before, it looked like it was fixed!!! Instead of failing in 10->90 seconds, it ran for over 5 minutes. I thought that was it!! BUT, it did eventually fail after 15 minutes. SO, I must keep my nose on my work! I will keep you guys (and gals) posted on what eventually fixes it!

ajs2 · July 18, 2011, 5:43pm

Is anything recorded on the System Console when the program hangs or crashes? Both types of failures should produce entries in the system log.

It is certainly worth filing a radar with Apple to further diagnose the problem.

Photovore · July 18, 2011, 11:15pm

Thanks, ajs2. The only error in the log is the one I put there when the enqueued read shows an Invalid Command Queue (error -36, I think).

I need to dig further before I can justifiy reporting this as a bug. I’m afraid that I may have posted here prematurely. My kernel is pretty complex, interpreting a terse language which describes the calculations to perform for each work item. So, though each operation it performs is well-defined and simply constrained, a whole program is more complex – I have to step through a program that fails on the gpu, and try to find its fatal flaw. In the meantime, I am putting in many unneccessary range checks etc. none of which have helped so far. (I must do this stepping-through in single-thread xcode C, where it doesn’t fail, since we can’t step through an operating kernel … and hope that I can find insights there.)

When I find out what I’m doing wrong, I’ll post it here, although the solution to my problem may not relate very strongly to the original title of this thread!!!

Photovore · August 8, 2011, 11:35pm

Hi folks. I found my problem two days ago, and, as I thought, it has little to do with the title of this post – except that it only fails on the GPU, and only under certain conditions. It seems I did post here prematurely, and shall try not to do so in future. It was just me being stupid. But, I did say that I’d report on it when solved, so, if anyone’s still interested, here it is:

My kernel is running an interpreter. (The language is very terse, i.e. each character represents a variable or an operator, so that the cost of interpretation vs. executing compiled code is minimized. Also, the kernel’s performance under opencl is excellent, because all kernel instances in a workgroup are interpreting the same piece of script, so all branching is in lockstep.)

Over 20 years ago, I incorporated a conversion function into my software which, under certain rare conditions, sets one of its outputs to “undefined” – where “undefined” is #defined as 32767. (The normal range for this variable is 0…1.) I’ve used that function ever since and never even noticed when that value went “undefined”, it happened so rarely (and had insignificant effect when it did happen).

There is at least one circumstance where the interpreter can take that value and run it through a tweening function – i.e. which pair of values in an array does it fall between, and then do something with its position between those two values. As the tweener was expecting a max value of 1.0, giving it 32767 ran it off the end of that array, looking for a float value greater than that. Then it would take the next float from that who-knows-where-we-are-now array and pick a distance between, and then use that in further calculations.

That’s as far as I went; I had found the problem, and I didn’t bother tracking the consequences of reading off the end of that array. Maybe it ran all the way to the end of that physical chunk of memory on the GPU without finding a float greater than 32767. Maybe it did find such a value and using it made something fail later.

In either case, as I said, I have used that subroutine for > 2 decades without problems … until trying it on my Macbook Pro’s nVidia GT 330M GPU.

So, it’s fixed, and it was just me being stupid. Sorry to bother everybody. Thanks for your attention. 'Nuff said!

== Dave

david.garcia · August 9, 2011, 5:28am

It’s great to see how the mystery was solved. Thanks!