OpenCL Spec: What Happens on Event/Command Failure

guillona · January 29, 2013, 8:30am

Hi,

I am having a hard time understanding the specification right now. What is supposed to happen if an enqueued command fails? What function call reports the error? What happens if I use out-of-order command queues and events, at what point do I find the problem from a previous event? What happens to child events?

Thanks for your help.

bknafla · January 31, 2013, 4:46am

Events are the key: on enqueueing you can retrieve an event associated with the enqueued command. If you waited directly or indirectly (via a later blocking enqueue call or clFinish) on the event to terminate you can then query its CL_EVENT_COMMAND_ EXECUTION_STATUS via clGetEventInfo() to learn if it succeeded or if an error occurred.

You can also register a completion callback with the event via the clSetEventCallback() call to have a supplied callback function called should the event state change.

Each enqueue command allows you to specify a number of events that need to be completed before the enqueued command can run. Though completion means both: success or failure.

The philosophy behind OpenCL’s asynchronous commands seems to be that you should create command queues that are correct and work. Flush the queue to let all of them (eventually) run and only sync as little as possible, e.g., via a last enqueue call that blocks or by enqueueing a barrier and waiting on its event. To be sure that no error occurred you then go through all command events you stored and check that they were successful.

My view of error handling in OpenCL might be incorrect through and probably more experienced devs want to chime in.

guillona · January 31, 2013, 6:14am

Hi.

The issue is from OpenCL 1.2 Section 5.9. The execution status can be an error code (must be negative), and the standard specifies:

The error code is a negative integer value and indicates that the command was abnormally terminated.

Later it specifies:

If the execution of a command is terminated, the command-queue associated with this
terminated command, and the associated context (and all other command-queues in this context)
may no longer be available. The behavior of OpenCL API calls that use this context (and
command-queues associated with this context) are now considered to be implementation-
defined.

Given these two pieces of information from the standard, my interpretation is that if any executed command terminates abnormally, the entire context and all command_queues are in an invalid state, and the program should be terminated abnormally.

bknafla · January 31, 2013, 6:54am

Ah, good find!

Hm, so events can still be queried, though if the wrong event is queried it does not even show that such a “catastrophic” error happened.

I’m unsure how usable the context callback is as it might be triggered after API calls use the now invalid context and queues…

Does this mean, that a clFinish() call behavior is undefined the moment an error happens while it waits?
Does this mean, that a failing command only occurs if the user set up the command badly but not from internal problems? All the error checks on creating resources and enqueueing commands indicate this. Therefore a terminated command shouldn’t happen in good code and is nothing to worry about after successful testing on target platforms?

guillona · January 31, 2013, 8:51am

Well, I don’t know about your programs, but I cannot assume mine are so well-behaved.

Consider a situation in which there are two completely independent event graphs, one is operating on memory A, the other B. At this time, if any operation on A fails, then I must assume that all operations on B are now going to be implementation-defined. I have no choice but to terminate the entire application! I would much rather destroy A (myself), and isolate errors.

Also unspecified is what happens to other events. If I have e1->e2->e3, and e1 terminates abnormally, what is the result of e2 and e3? Are they executed? What error is returned?

Some clarification is required on the “intention” of the specification here. If the intention is that the system is now in an inconsistent state, then this should be clearly stated.