Suggestions for next release of OpenCL

“uni->”
And you will still need to cache values once in private or local memory, because access speed is a thing.

@Salabar: To my knowledge, you cannot do something like put an image object inside a struct. Imagine a function that returns something like e.g. a random seed, coefficient or other such value that the user might need to process things further. However, to compute this value, the function needs to sample 4 image objects and a float4.


// note: secretly a few uniforms/UBOs prepended here. User would never know, because of #line pragma.
__kernel void filter(... parameters the user cares about ...) {
    // maybe here the user gets some input data from buffers...
    type result = sub_function_the_user_wrote(info);
    // maybe here the user writes the results to some output buffer
}

type sub_function_the_user_wrote(info) {
    float coefficient = get_coefficient();
    // do calculations and return result
}

into this:


__kernel void filter(... parameters the user cares about ..., image irrelevantA, image irrelevantB, image irrelevantC, image irrelevantD, float irrelevantE) {
    // maybe here the user gets some input data from buffers...
    type result = sub_function_the_user_wrote(info, image irrelevantA, image irrelevantB, image irrelevantC, image irrelevantD, float irrelevantE);
    // maybe here the user writes the results to some output buffer
}

type sub_function_the_user_wrote(info, image irrelevantA, image irrelevantB, image irrelevantC, image irrelevantD, float irrelevantE) {
    float coefficient = get_coefficient(image irrelevantA, image irrelevantB, image irrelevantC, image irrelevantD, float irrelevantE);
    // do calculations and return results
}

(If there were writable global variables, we could at least put the images into those.)

Now this new code, we have the following new issues:

[ul]
[li]- conceptually, get_coefficient() is a pure function that does not change in its parameters (withhin the execution of a single kernel instance) so there is no logical reason why it should take parameters[/li][li]- the code is badly decoupled – if I now want to allow the user access to a new function get_other_coefficient(), the user will have to change all function signatures all the way down the chain into the function where he/she actually invokes the function![/li][li]- It’s terribly bloated and doesn’t scale well – every time I add a new get_something() function, the user might have to add 4 new parameters to all their functions! It also looks terrible. Note that I cannot do stuff like e.g. “atlas” the images into one image, since depending on the sampling mode, that would not end well.[/li][li]- It’s brittle and smelly: With so many parameters for each function, the user can easily switch around two parameters, resulting in no compile-time or runtime-error. The results will simply be wrong, and likely in very subtle ways. Additionally, when the user does not need get_other_coefficient() anymore, he/she is unlikely to actually remove the extra parameters (just leave them in) meaning there are now unused parameters, which is a code-smell[/li][li]- It requires significant extra logic on my part – I have to know what the user wants in each parameter and pass it into the kernel when appropriate. In the case of GL et al, I can simply always populate the uniform buffers with my (static) data, and then the user might use them or not, it doesn’t matter. Admittedly this point is moot if you get to use CL 2.1 where you have clGetKernelArgInfo() (Why did this take so long to appear? Maybe I get to use 2.1 in 6-8 years…)[/li][/ul]

Note how uniform (buffers/variables/SRVs) solve the problem easily, elegantly and completely, and the user never needs to see what happens behind the scenes of the library, as opposed to the brittle and ugly alternatives that do not scale as more code is added (by libraries or by the user) and require control over a bigger part of the execution process on the client-side (no longer can I just say “library.populateYourConstants()”, I have to work together with the library on how to invoke the kernel).

I see. Still, I think uniforms are too straightforward decision. It makes runtime to perform more implicit operations, which in turn may potentially harm perfomance or perfomance-portability. And it has all of the downsides of global variables. What if I want to do some operation simultaneously multiple times with different sets of uniforms? Perhaps, something like this could be more flexible:

OpenCL C keyword __bundle. In a kernel, bundles are similiar to structs, but they are allowed to contain samplers, __global and __constant pointers, scalar values, bundles (no recursive definitions) and are immutable. Arrays of bundles and pointers to bundles are not allowed. Bundles can serve as arguments to functions and kernels.
From runtime standpoint, a bundle is an object created by
clCreateBundle(cl_program,“bundle_type_name”) (bundles can be shared amongst kernels from different programs if identical bundles are defined in each program)
And modified by
clSetBundleField(. . .);
Both functions behave equivalently to clCreateKernel and clSetKernelArg correspondingly. Runtime should recognize type mismatch when clSetKernelArg and clSetBundleField are called for bundle arguments and return CL_BUNDLE_TYPE_MISMATCH.
Such an object corresponds well with Vulkan’s descriptor sets. And, I believe, it does solve the problem fairly cleanly, while keeping OpenCL framework consistent.

I do not believe there would be any performance impact due to introducing uniforms, as they are simply nothing other than a more flexible syntactic equivalent to kernel parameters. They are really just that – a syntactic thing that makes your code more maintainable and simpler.

If I had all the resources in the world and I would be willing to add some compiler project with millions of lines of code to my tiny project with a few 100kLOC and a handful of developers, and then subsequently maintain that compiler project and hack around in it, I could, theoretically, perfectly replicate uniforms by transforming them to kernel parameters. I hope you agree though that this is really not an option for many projects. Certainly not for any in the scientific field!

What if I want to do some operation simultaneously multiple times with different sets of uniforms?

Do you mean inside a single kernel invocation or in different invocations? Inside a single kernel invocation: You cannot, end of story. Same as in all other APIs. For different invocations: no problems, there is really no difference in this regard to normal kernel parameters. You set them on a per-kernel basis, then you kick off the kernel, done. It’s no different to setting the kernel parameters and then kicking off the kernel.

Your bundle idea is certainly much better what we have now, but I think it’s still a far less elegant from the simple model uniforms give you. Having bundles still means that you will have to pass some $LIBRARYs bundle into every single function of that library that you use. Since it’s just a single parameter now, that does reduce the code-smell et al, but it doesn’t completely solve it. If I want to offer my library in e.g. a “modularized” way so that the user may only need to include what he/she actually wants to use (in order to avoid bloating the kernel and the bundle), the user will still need to potentially accept many bundles into the kernel function and then pass the correct bundle down into the function chain that will end up actually using a given bundle.

I still think the uniform solution is much better though. It is a very simple problem, and uniforms are a very simple and elegant solution. They do not require any substantial additions to the API (e.g. look at the few, simple functions GL has to handle UBOs) and (guessing from my – admittedly limited – knowledge of compiler architecture) require no major changes on the implementors side to realize: Their code-generation most likely already works in a fashion that it takes your kernel parameters and plucks them apart into individual buffers which are then made available to the code that was generated for your kernel function. So for those kind of implementations, something like UBOs is simply another injection point into that system that uploads buffers and makes them available to the code. For more exotic types of implementations (like FPGAs) there should also be no issues, as any CL C code that utilizes uniforms can be converted through a series of syntactic modifications into a semantically equivalent program that takes those uniforms as kernel parameters. No efficiency lost.

As to uniforms having similar downsides of global variables: to the kernel itself, uniforms are simply constants. So I do not think they introduce any downsides that passing parameters into your function or having __const doesn’t already introduce. Even the purest of the purest functional languages used in the whitest of ivory towers do not deny you the capability of adding global constants (such as PI) and then using them inside any pure function (which may be of nullary type.)

You set them on a per-kernel basis, then you kick off the kernel, done. It’s no different to setting the kernel parameters and then kicking off the kernel.
But it means parameters cannot be shared amongst a family of kernels. I finally understood what you want. :smiley: Well, for that you clearly just take an open source C -> SPIR-V and inject few branches of syntax tree, taken from some uniform{} block, into each function declaration and a non-built-in function call. Uneсessary parameters in ordinary functions will be optimized out automatically upon inlining. In kernels, well, it will probably require some actual brain process if you care about those. It positively will not be hard. It also adds a 100k lines dependancy, but oh well.

the user will still need to potentially accept many bundles into the kernel function and then pass the correct bundle down into the function chain that will end up actually using a given bundle.

Create an uber-bundle and make it an argument for every single end-user function of your library. They can be simply wrappers: user_calls_this(uber_bundle){ function_that_is_actually_called(uber_bundle.bundle_of_concrete_module);}

But it means parameters cannot be shared amongst a family of kernels.

Not entirely sure what you mean by that, having to set the uniforms & parameters again for a new kernel doesn’t seem like a big deal to me (same thing as in GL,) or, in the case of UBOs in GL, you set the buffer and then it is available in all kernels that declare it (so it is “shared” in a way, I guess)

Well, for that you clearly just take an open source C -> SPIR-V and inject few branches of syntax tree, taken from some uniform{} block, into each function declaration and a non-built-in function call.

I’m going to take a doubtful stance on that unless proven otherwise. I don’t think this is a viable strategy unless you have plenty of left-over resources on your team. I have not looked at any of the open-source compilers, but getting used to the codebase of one, adding uniforms, traversing the tree backwards from the leaf nodes to the root kernel function to figure out which parameters need to be passed where, emitting some sort of RTTI so that the client can know what paramters to set, integrating the compilers API into your app so that you can use it to generate the bytecode on-the-fly, … sounds like significantly more work than what e.g. one person or two could do over the course of a week. Additionally it puts the burden of maintaining a larger code-base on you, having to update the compiler, porting your patches and whatnot.

I think there are also good reasons for not wanting to precompile your stuff to SPIR/SPIR-V e.g. for integration with debuggers and other tools. These would of course not know about uniforms (which are now secretly transformed to parameters) unless I also patch those as well. Another thing to note is that even in 2.0, SPIR ingestion is still an optional feature, so that adds another dimension of compatibility issues to the scenario (not that there’s anything to be done about that now, anymore, of course.)

And again, this problem is so easily and elegantly solved in all other APIs (that I know of) by about 10 lines of code that I really think there is a strong case for adding UBOs/uniforms/etc.

Hi. I have a small future-proof suggestion regarding clEnqueueMap*.
A description of CL_MAP_WRITE should not have a phrase “guaranteed to contain the latest bits in the region being mapped”. Such description renders this flag to be equivalent to CL_MAP_READ|CL_MAP_WRITE. Imagine a runtime developer wants to avoid cache-contamination by allocating host memory from the uncacheable region of RAM(1) when CL_MAP_WRITE flag is used. In this case applications that are written by undisciplined programmers will suffer a perfomance hit because they read from a buffer mapped for writing. And there is nothing in the OpenCL standart to warn you off from doing that. Khronos probably has technical writers better than me, but I think the description should go something like this: “It is guaranteed that non-overriden contents of the pointer returned by clEnqueueMap{Buffer|Image} will remain unchanged upon unmapping”. One of the ways to achieve this is to copy the whole buffer (just what we do today), but my suggestion delivers a bit more space for interpretation while requiring no functional alterations to existing implementations.

(1) Honestly, no clue if such region is actually a thing for any existing platform. The idea itself does not seem to be completely unreasonable though.

I would like to easily be able to find which (1 or more) device is connected to the display. Currently, there is no way to do this in OpenCL. I believe Apple has some proprietary functions that allow for this, but it’s obviously not cross platform. Please add this to the next version of OpenCL. Thanks.

Currently, the only way to asynchronously build a program object is to pass a callback to clBuildProgram. This inevitably leads to spaghetti code, because most useful OpenCL programs need to wait, at some point, for the program object to be built. The alternative of using synchronous compilation is a waste of time that would be better spent doing additional host setup.

I believe that this is because clBuildProgram uses the wrong abstraction. Callbacks are good for one-off functions which the main program does not actually rely on, like context error handlers or buffer object destruction cleanup, but they are not suitable for handling events that the main task must await at some point.

For this reason, I propose the addition of some sort of clBuildProgramWithEvent command which, instead of a callback, would use an OpenCL user event object to asynchronously notify users once a program is built. This is more idiomatic OpenCL, integrates much better with the command queue infrastructure, and thus allows for more elegant OpenCL programs to be built. I am actually currently building a library-based implementation of this, but it occured to me that the functionality was more generally useful enough, and simple enough to implement, to potentially warrant inclusion into core OpenCL.

If the idea is successful, a later OpenCL release might decide to ultimately deprecate clBuildProgram and make clBuildProgramWithEvent the One True Way to build programs.

[QUOTE=HadrienG;39615]Currently, the only way to asynchronously build a program object is to pass a callback to clBuildProgram. This inevitably leads to spaghetti code, because most useful OpenCL programs need to wait, at some point, for the program object to be built. The alternative of using synchronous compilation is a waste of time that would be better spent doing additional host setup.

I believe that this is because clBuildProgram uses the wrong abstraction. Callbacks are good for one-off functions which the main program does not actually rely on, like context error handlers or buffer object destruction cleanup, but they are not suitable for handling events that the main task must await at some point.[/quote]

Callback continuations is a time-honored and tested asynchronous technique, used in many libraries in actual production code today. Boost.Asio being a prominent example, a library that’s being standardized into C++ in a Technical Specification currently.

With callback continuations, you can choose how you want to synchronize with the asynchronous operation. That’s the purpose of the callback, to decide how that works. You could put them into some kind of concurrent queue to be read at a time with the program finds it convenient. Whatever mechanism you can think of, this function will support.

Callback continuations only lead to “spaghetti code” if you allow it to.

[QUOTE=HadrienG;39615]For this reason, I propose the addition of some sort of clBuildProgramWithEvent command which, instead of a callback, would use an OpenCL user event object to asynchronously notify users once a program is built. This is more idiomatic OpenCL, integrates much better with the command queue infrastructure, and thus allows for more elegant OpenCL programs to be built. I am actually currently building a library-based implementation of this, but it occured to me that the functionality was more generally useful enough, and simple enough to implement, to potentially warrant inclusion into core OpenCL.

If the idea is successful, a later OpenCL release might decide to ultimately deprecate clBuildProgram and make clBuildProgramWithEvent the One True Way to build programs. [/quote]

Why does OpenCL need to provide something that you can do with its existing facilities? And why would they remove the ability to use any other synchronization mechanism? It seems to me that this is the whole point behind clBuildProgram, that the OpenCL developers don’t want to get into the business of telling you how to synchronize.

I most certainly do not dispute the usefulness of callbacks in general, I just believe that in this specific case, the event abstraction provided by OpenCL is a better fit.

With callback continuations, you can choose how you want to synchronize with the asynchronous operation. That’s the purpose of the callback, to decide how that works. You could put them into some kind of concurrent queue to be read at a time with the program finds it convenient. Whatever mechanism you can think of, this function will support.

OpenCL implements two different models for asynchronous processing, callbacks and events. The idea, as far as I understand it (but feel free to prove me wrong), is that events are used for anything which a command queue operation may wait upon, and callbacks are useful as a more flexible abstraction in other scenarii.

For example, the use of callbacks for context error delivery is justified, because it would be idiotic for a command queue operation to wait for an error to occur. What happens instead is that the context error interrupts the normal control flow of a program, typically in order to print a debug message on stderr and/or halt the program.

However, program object build completion is neither an exceptional event, nor a bookkeeping operation that silently happens in the background. It is an active part of the processing of an OpenCL program, which command queue items may rightfully wait upon. In a complex OpenCL program having multiple computation stages, each handled by a different program object, it would not be ridiculous to asynchronously start all program compilations in one place, then have each computation stage wait for the compilation of the associated program object to finish before proceeding.

Which is why I believe that events would be a better fit in this case. That would be more consistent with the general design of OpenCL, which is that everything which can take a lot of time is asynchronous, and any asynchronous process may wait for another asynchronous process through events.

Callback continuations only lead to “spaghetti code” if you allow it to.

Spaghetti code happens whenever one tries to synchronize with an asynchronous callback. And because OpenCL limits what can be done inside of a callback, this is bound to be needed at some point.

Consider how you would implement such main thread-build process synchronization, and I believe you will quickly find out that you would either:
1/Waste system resources by e.g. having two threads, one thread A which synchronously compiles the program and one thread B which waits one a condition variable that will be set by thread A upon completion
2/Use a user-defined event, the way I did, to integrate your event wait into the normal flow of OpenCL asynchronous event processing.

If there is only one sane solution to a problem, then it should be integrated upstream in my opinion.

Why does OpenCL need to provide something that you can do with its existing facilities? And why would they remove the ability to use any other synchronization mechanism? It seems to me that this is the whole point behind clBuildProgram, that the OpenCL developers don’t want to get into the business of telling you how to synchronize.

Flexibility is fine and good, but good standards are also about having a consistent design and taking a stand in situations where, as I believe is the case here, there is only one reasonable way to do things.

As per my experience releasing temporary buffers in the middle of a chain of kernels executing asynchronously is currently cumbersome.
Actually it requires either a synchronization with the device to guarantee that all pending operations using the buffer have finished or a clumsy event callback on a marker with wait list.
The drawback of the first is that it introduces needless synchronization just to release memory, and the disadvantage of the second besides the horrible syntax is the fact that there is no guarantee as to when the callback will be invoked.

Event, semantically, is something both host and device control, whereas compilation is purely a host-side operation. I’m not sure if using cl_event itself won’t break anything today or somewhere in the future. If there is an actual need for this, I’d add clBuildProgramOffline which returns a program handle right away. If you try to create a kernel before program is compiled, that thread waits for compilation to complete. We’ll also need “isCompiled” and “WaitForCompilation” functions.

But what are the use-cases for this anyway? If you are building a real-time application you by default want to compile whatever you can upfront. SPIR-V allows more or less straighforward threaded compilation built into driver, so you won’t need to bother with it on application side. And even if you want to use compile-time constants as an optimisation, previous point allows you to freely do this:
SetUserEvent
BuildWithIL
ResetUserEvent
(other threads wait for event, ofc)

[QUOTE=HelsaBryan;39618]As per my experience releasing temporary buffers in the middle of a chain of kernels executing asynchronously is currently cumbersome.
Actually it requires either a synchronization with the device to guarantee that all pending operations using the buffer have finished or a clumsy event callback on a marker with wait list.
The drawback of the first is that it introduces needless synchronization just to release memory, and the disadvantage of the second besides the horrible syntax is the fact that there is no guarantee as to when the callback will be invoked.[/QUOTE]
Actually, you do not need to do this. Just release that buffer right away, the specification for clReleaseMemObject guarantees you that the buffer won’t be actually freed until no queued command relies on it anymore:

“After the memobj reference count becomes zero and commands queued for execution on a command-queue(s) that use memobj have finished, the memory object is deleted. If memobj is a buffer object, memobj cannot be deleted until all sub-buffer objects associated with memobj are deleted.”

Hold on a second.

You say that you want clBuildProgramWithEvent, because you want to be able to queue up work based on that program’s compilation. How exactly could you do that?

After all, you don’t queue up work with programs; you queue up work with kernels. And you can’t get a kernel from a program if that program has not [i]finished[/i] compilation. Which means that this event you want to have the queue wait on must have already fired. So what’s the point of waiting on an event that’s already over.

Your whole argument is based on a pattern of usage that is impossible.

But let’s say that it did work. Let’s say that we (somehow) allow you to get a kernel from a not-yet-compiled program, and then you enqueue some work based on that.

What happens if compilation fails? Should clBuildProgramWithEvent signal the event even if it fails? That makes little sense, as it would allow you to invoke an operation with a non-existent kernel taken from a non-existent program. If it doesn’t signal the event… well, how does it communicate to you that it has failed? Furthermore, how do you abandon the work in the queue that will never be able to execute? I’m not entirely up on OpenCL 2.1, but I’m fairly sure it doesn’t have an API for removing work that you’ve enqueued. And you have to get that work out of the queue somehow, since by the nature of a queue, the later items must wait for the previous items before they can be executed.

Your idea does not make sense in many ways.

Spaghetti code happens whenever one tries to synchronize with an asynchronous callback.

If you’re going to call any callback synchronization “spaghetti code”, then you have a very loose definition of that term. This form of synchronization happens all the time in continuation-based programming. It’s not an anti-pattern or even a code-smell. It’s just what you do.

1/Waste system resources by e.g. having two threads, one thread A which synchronously compiles the program and one thread B which waits one a condition variable that will be set by thread A upon completion

Pre-emptively declaring that the alternative is a “waste” is not a reasonable strategy for winning an argument. Especially when the other alternative is impossible, as shown above.

You don’t have to use threads for synchronization here. All you need is to have something to do between the time you start the compilation and the time you need it finished. Whether that “something” happens because you’re locked on a mutex, or because you’re checking an atomic variable or what-have-you, as long as you can put that time to good use, you’re fine.

User events are also host-side, which is why I use them in my own library-based implementation for the BuildProgramWithEvent concept I’m proposing here. They are really a great idea, I’d like to personnally thank whoever introduced them in the OpenCL spec. But point taken, kernel creation is also something to watch out for in an asynchronous compilation scenario.

But what are the use-cases for this anyway? If you are building a real-time application you by default want to compile whatever you can upfront. SPIR-V allows more or less straighforward threaded compilation built into driver, so you won’t need to bother with it on application side. And even if you want to use compile-time constants as an optimisation, previous point allows you to freely do this:
SetUserEvent
BuildWithIL
ResetUserEvent
(other threads wait for event, ofc)

Due to the actions of a certain hardware manufacturer, I’m currently stuck in the past with OpenCL 1.2, so for me SPIR is unfortunately still some extension that no one supports and instead I’m stuck with binaries that may stop working with a mere driver update. But I believe that even with SPIR, you still need to go through a build stage where the intermediate representation is converted into device binaries. The only difference is that the process is much faster, so perhaps it is less of a performance problem to do it synchronously.

[QUOTE=Alfonse Reinheart;39621]Hold on a second.

You say that you want clBuildProgramWithEvent, because you want to be able to queue up work based on that program’s compilation. How exactly could you do that?

After all, you don’t queue up work with programs; you queue up work with kernels. And you can’t get a kernel from a program if that program has not [i]finished[/i] compilation. Which means that this event you want to have the queue wait on must have already fired. So what’s the point of waiting on an event that’s already over.

Your whole argument is based on a pattern of usage that is impossible.[/quote]
You are right, I overlooked this point so far. True asynchronous compilation would also require kernel creation commands to be able to wait for program compilation to complete. We would effectively need to also make a lot more things asynchronous to make this work, including kernel creation and argument setup.

But let’s say that it did work. Let’s say that we (somehow) allow you to get a kernel from a not-yet-compiled program, and then you enqueue some work based on that.

What happens if compilation fails? Should clBuildProgramWithEvent signal the event even if it fails? That makes little sense, as it would allow you to invoke an operation with a non-existent kernel taken from a non-existent program. If it doesn’t signal the event… well, how does it communicate to you that it has failed? Furthermore, how do you abandon the work in the queue that will never be able to execute? I’m not entirely up on OpenCL 2.1, but I’m fairly sure it doesn’t have an API for removing work that you’ve enqueued. And you have to get that work out of the queue somehow, since by the nature of a queue, the later items must wait for the previous items before they can be executed.

Your idea does not make sense in many ways.

OpenCL already has a mechanism for dealing with asynchronous actions that fail, which is probably the longest OpenCL error code in the spec : CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST.

Any command with an event wait list can throw that error if it relies on an event that failed. Since event failure is transitive, this means that ultimately, at some point, the user will be informed of the compilation failure. It is then possible to isolate the failure point by checking the events in the wait list, moving up the asynchronous call chain until the culprit is found.

If you’re going to call any callback synchronization “spaghetti code”, then you have a very loose definition of that term. This form of synchronization happens all the time in continuation-based programming. It’s not an anti-pattern or even a code-smell. It’s just what you do.

Again, callbacks are a fine abstraction if they live a life of their own and no other code ever needs to synchronize with them. For example, if you have N independent computations, each relying on a different program object, then using the program object compilation callback to perform each computation works well.

But as soon as unrelated callbacks need to synchronize with another, as is the case in my pipelined example where each computation relies on data from the previous one, callbacks become clunky. We effectively move outside of the comfort zone of that abstraction.

Pre-emptively declaring that the alternative is a “waste” is not a reasonable strategy for winning an argument. Especially when the other alternative is impossible, as shown above.

You don’t have to use threads for synchronization here. All you need is to have something to do between the time you start the compilation and the time you need it finished. Whether that “something” happens because you’re locked on a mutex, or because you’re checking an atomic variable or what-have-you, as long as you can put that time to good use, you’re fine.

Consider this scenario. Don’t pay too much attention to the numbers, in a parallel world the events may occur in (almost) any order :

[ol]
[li]Program 1 compilation starts[/li][li]Program 2 compilation starts[/li][li]Program 1 compilation is finished, kernel 1 is created and computation 1 starts[/li][li]Program 2 computation is finished, kernel 2 is created, waits for computation 1 to complete[/li][li]Computation 1 is finished, computation 2 proceeds[/li][/ol]
The problem lies at the fourth step. If we implement this naively using callbacks, one for computation 1 and one for computation 2, a race condition may occur if program 2 compilation finishes before computation 1 starts. In this case, kernel 2 will wait for an event which has not been created yet by the callback associated to the compilation of program 1. In this case, the OpenCL runtime will say “nope” and return a CL_INVALID_EVENT_WAIT_LIST.

Thus, we need synchronization, but how can we do that ? If kernel 2 sleeps on a mutex, then in a non-preemptive OpenCL implementation where callbacks are scheduled only when a clXyz command is run, the callback for program 1 compilation will never be called, and the program will deadlock.

Without using events, or reinventing them, the only way to solve this problem in such an OpenCL implementation (which is perfectly legal according to the spec), is to have the callback associated to program 2 compilation create a thread and instantly make said thread block on a mutex. In a chain of N computations, this can be generalized to creating N-1 threads.

But blocked threads are system resources. They waste RAM, they may also waste CPU time in operating systems where mutex synchronization is implemented using spinlocks,which is why it is a bad practice to have many threads opened whose sole purpose is to wait on some kind of semaphore. This is exactly the problem which the OpenCL event model has been designed to address, hence my proposal to extend it to program compilation scenarii.

User events are also host-side, which is why I use them in my own library-based implementation for the BuildProgramWithEvent

Question is, can a driver developer do anything you cannot? To implement this function, they’d need create a thread and call the synchronous function. Exactly the same way you would do this. I’ve already mentioned in this topic, unless something cannot be implemented efficiently without native support, it should be in an utility library, not in core specification. The same can be said about clEnqueueTask. I does literally nothing for OpenCL computation model and I doubt anyone ever used it, so why is it even there?

I don’t think an implementation needs to create an additional thread here. You can instead have a small host-side command queue which takes care of the compilation process.

The beauty of asynchronous programming using command queues is that command processing is separate from command issue, so drivers are free to either execute commands synchronously one by one (as in in-order queues) or fire up multiple threads to execute independent commands in parallel. Moreover, there doesn’t need to be a 1:1 mapping between tasks and threads, which allows the implementation to use exactly the amount of threads that makes sense for a given CPU/OS.

Similarly, the simplest (and often most efficient) callbacks implementations do not create additional threads. They switch between the main program and callback code only at explicit synchronization points, which in OpenCL would be the clXyz API calls, in a cooperative multitasking fashion.

Correcting myself as I went past the edit delay…

No, we don’t. We only need to synchronize with the program build operation at the kernel creation stage, the way Salabar discussed earlier. Kernel argument setup and so on are “lightweight” operations, and are thus best done synchronously without any asynchronous operation overhead. This also makes error management easier : we can simply have kernel creation throw CL_BUILD_PROGRAM_FAILURE if asynchronous program builds fail.

With this, we still get the benefits of asynchronous program compilation (we’re effectively dealing with program object futures, which cause no synchronization until we really need to wait for them), without needing to make the whole kernel setup asynchronous. This is much less complex than I feared.