Suggestions for next release of OpenCL

Salabar · November 23, 2015, 11:36am

Well, yeah. There are plenty of ways to implement this, each of them available to you, and you get to choose which one fits your task better. But if you are using an API function, you are stuck with the one driver developer liked the most. There is always an option to disregard built-in stuff, but what the heck was the point then?

HadrienG · November 23, 2015, 12:09pm

The point I want to make here, though whether I will manage to convince everyone is still an open question, is that there are really only two reasonable ways to build program objects :

[ol]
[li]Synchronously build the program and return.[/li][li]Asynchronously build the program, but still make sure that it will be built when needed (e.g. in the body of clCreateKernel)[/li][/ol]
Scenario 1/ is already well taken care of by clCreateKernel without a callback. On the other hand, I believe that callbacks are a poor fit for scenario 2/, and that an event-driven or future-based abstraction would work better.

The future-based scenario could be implemented as simply as by having clCreateKernel accept as an argument a program which is in the process of being built, and synchronize on it. That would address all the scenarii that I have presented so far, at minimal spec modification costs (one change to clBuildProgram to allow for asynchronous builds without callbacks, and one to clCreateKernel to allow for program build synchronization).

The asynchronous scenario is more difficult to implement, as it involves essentially making the whole kernel creation stage asynchronous. So I could relatively easily be convinced that it is overkill.

I can be convinced that some concepts cannot be implemented efficiently (e.g. OpenGL uniforms). But if we can never trust the implementations, then there is no point in developing standards. That would be a very sad conclusion to reach on these forums

The reason I am trying to have my proposal merged into the standard, rather than simply implementing it myself in my library, is that 1/OpenCL implementations can do it more efficiently than I and 2/I would hate it, as a developer, if OpenCL libraries entered a state of fragmentation where they differ even on basic features that almost directly map the spec.

Salabar · November 23, 2015, 2:44pm

The reason I am trying to have my proposal merged into the standard, rather than simply implementing it myself in my library, is that 1/OpenCL implementations can do it more efficiently than I and 2/I would hate it, as a developer, if OpenCL libraries entered a state of fragmentation where they differ even on basic features that almost directly map the spec.

Don’t get me wrong, I’m not against cool stuff added to OpenCL. But every new feature increases complexity of the API. More complexity means more things that can go wrong in runtime or in a developer’s head (you can’t become a guru of a language if it gets bigger with each passing day). And if some functionallity is redundunt, it means more redundunt complexity. What I think Khronos should do is to form a small working group that will develop an official crossvendor utility toolkit. Wanna asynchoronous compilation? Here it is. Wanna query indices of named kernel arguments? Our library can parse OpenCL C\SPIR-V to retrieve them. Wanna see runtime exception when trying to bind read-only buffer to __global parameter or violating __restricted keyword? Use this wrapper. There can be a ton of “quality of life” ideas that can be implemented without driver involvement.
I believe, this is a way to go for the future of OpenCL. It is good for IHVs - they can focus more on optimisation of actual relevant code. It is good for developers - they know they can rely on open-source utility toolkit for generic tasks and if they don’t like something in it, alternative solution can work as fast.

Btw, SYCL literally horrifies me and I don’t consider it a universal solution. I’m talking about something more lightweight.

Alfonse_Reinheart · November 23, 2015, 4:15pm

[QUOTE=HadrienG;39628]The point I want to make here, though whether I will manage to convince everyone is still an open question, is that there are really only two reasonable ways to build program objects :

[ol]
[li]Synchronously build the program and return.
[/li][li]Asynchronously build the program, but still make sure that it will be built when needed (e.g. in the body of clCreateKernel)
[/li][/ol]
Scenario 1/ is already well taken care of by clCreateKernel without a callback. On the other hand, I believe that callbacks are a poor fit for scenario 2/, and that an event-driven or future-based abstraction would work better.

The future-based scenario could be implemented as simply as by having clCreateKernel accept as an argument a program which is in the process of being built, and synchronize on it. That would address all the scenarii that I have presented so far, at minimal spec modification costs (one change to clBuildProgram to allow for asynchronous builds without callbacks, and one to clCreateKernel to allow for program build synchronization).[/quote]

Here’s what confuses me. Earlier, you talked about “wasting” resources by using threads and so forth, to avoid having to wait on something. Yet all of a sudden, now you’re willing to wait, which requires you to use threads in order to keep the CPU busy.

Furthermore, what you’re talking about is a high level API. What you want is something you could build right now on top of the existing OpenCL infrastructure. Plus, that infrastructure services needs other than yourself.

For example, let’s say you need to compile 20 programs. And while you probably could go through the trouble of figuring out which programs are needed before others, who cares. You just want to wait until they’re all done, and you have other work you could be doing. So instead of using a heavy-weight synchronization like 20 promises/futures or whatever, you just have your callback atomically bump an integer. When it reads “20”, the process is done and you can go on to actually use them.

I see no reason why they should add this feature that serves your particular use case.

Oh and FYI: The plural of “scenario” in English is “scenarios”. Even in Italian, it is “scenari”, not “scenarii”, which is considered archaic.

Implementing a basic promise/future is not a hard problem. It’s not like the OpenCL’s implementation is going to make use of specialized asynchronous compute units or queue processes or anything. It’ll just use whatever features are available to the OS, just like you would. Equally importantly… the efficiency of such an implementation is irrelevant. So unless you do something boneheaded in your promise/future, your application’s performance will not be impacted.
That would only matter if you and some other code were interoperating within OpenCL. That is, both of you were talking to OpenCL and you needed to cooperate. If so, either you’re creating programs and they’re picking out kernels, or you’re picking out kernels from programs they create. In the first case, all you need to do is use your synchronization mechanism to make sure that the program is finished before calling the receiving API. In the second case, they will already have made sure their programs are finished before calling you. Either way, there’s no need for them to use your layered APIs directly.

HadrienG · November 23, 2015, 11:56pm

I agree that standards should strive to remain lean indeed, but I’m less sure about the need for an officially supported utility library. See how that turned out for GLUT/FreeGLUT : many people try to use it these days because it has an official look to it (and old tutorials use it), in spite of alternatives like glfw being usually a better fit in term of stability and ease of use nowadays.

If a feature is useful enough that everyone will likely need it, and if hard enough to get right that everyone will likely get it wrong the first time (either from a point of view of correctness or efficiency), then it belongs to the standard. Which is why, for example, OpenCL command queues are standard : we could imagine a world in which the structure of command queue objects was publicly specified and people would build them themselves instead of calling clEnqueueSomething, but that would be needlessly error prone for little gain.

Otherwise, Khronos probably do not need to bother themselves with the feature at all. Neither in the standard, nor in any officially supported library.

Wanna asynchoronous compilation? Here it is.

OpenCL is an asynchronous library. In its core design, it is built upon the point of view that building an asynchronous infrastructure is hard and best left to driver and OS vendors, while people use higher-level abstractions of asynchronous processes like events (a point of view which I agree with, especially in a programming language with no built-in concurrency primitive like C). This is why it already supports asynchronous compilation, although using callbacks which IMO are a poor fit.

Wanna query indices of named kernel arguments? Our library can parse OpenCL C\SPIR-V to retrieve them.

But could it also do so with built-in kernels, binaries, vendor-specific IL, and all the other forms of program objects that are supported by OpenCL ?

Once OpenCL makes the choice to support binary/vendor-specific blobs, it also needs to help the poor souls using them make some sense of why their program is crashing.

Wanna see runtime exception when trying to bind read-only buffer to __global parameter or violating __restricted keyword? Use this wrapper.

Yes, I agree about that one. We can’t retrofit exceptions into C just because they are usually a better fit than error codes, so only a wrapper may implement OpenCL exceptions correctly (even though translating every possible error code to a similarly named exception, while important from a usability point of view, certainly is a pain).

Salabar · November 24, 2015, 12:48am

But could it also do so with built-in kernels, binaries, vendor-specific IL, and all the other forms of program objects that are supported by OpenCL ?

There could also be a wrapper that remembers all the data needed for a binary serialized program and a plugin for each IL.

even though translating every possible error code to a similarly named exception
I mentioned these two examples because both of them are undefined behavior OpenCL driver don’t have to react in any way. This is made for perfomance, but it is only relevant for CPU bound tasks. In 90% other cases having an error-checking will save a few hours of debug.

Anyway, those are just examples I’ve come up with in 10 minutes. If CLUT will become a thing, a thorough analysis have to be performed first.

HadrienG · November 24, 2015, 1:01am

Indeed, after considering the implications of never blocking at all, which are basically to make everything kernel-related event driven, I had to change my mind a bit on this one, and accept that in order to minimize standard changes, it’s probably best not to go all the way in this direction. Even if that means a bit of lost CPU time, which I am no fan of.

Instead, what I now consider an acceptable compromise is this :

Start building program 1
Start building program 2
…
Start building program 20
If there is some CPU-side processing to do, do it here
Set up kernel 1 – Synchronize with program 1’s compilation
Enqueue work with kernel 1
If there is some CPU-side processing to do, do it here
Set up kernel 2 – Synchronize with program 2’s compilation
Enqueue work with kernel 2, which has the end of the kernel 1 work in its wait list
…

In this scenario, we keep the device driver better fed with work than in a fully synchronous scenario (it still has compilation work to do while we set up kernels and enqueue device work), so we reap some asynchronicity benefits. But futures are less efficient than fully asynchronous processing primitives, because they make it more difficult to do CPU processing as you mentioned : one would need to periodically poll for program build status, which is not ideal. This future-based design is more of a compromise between conflicting concerns, which I am proposing as an answer to the very valid concerns that emerged in this thread :

[ul]
[li]A reasonable OpenCL suggestion should not imply rewriting 20 pages of the standard.[/li][li]It should minimize the need for blocking.[/li][li]It should not leave users with a solution for asynchronous program builds which is either race-prone, deadlock-prone or inefficient in non-trivial cases.[/li][/ul]

Furthermore, what you’re talking about is a high level API. What you want is something you could build right now on top of the existing OpenCL infrastructure. Plus, that infrastructure services needs other than yourself.

Define high-level, though. OpenCL is not about writing data to memory-mapped registers, binding PCI interrupt listeners, deciphering gibberish hardware packets which violate their own spec half of the time, and sacrificing threads to poll the hardware status bytes that do not come with interrupt support because some hardware engineer decided that interrupt controller chips and interrupt lines were too expensive.

What I am trying to find, and what I think specifications like OpenCL are also about, is the right level of abstraction. Something which does not hide too much hardware power, and incurs only as little run-time overhead as necessary, but still optimizes API usability and implements proper separation of concerns between device drivers and application developers.

For example, let’s say you need to compile 20 programs. And while you probably could go through the trouble of figuring out which programs are needed before others, who cares. You just want to wait until they’re all done, and you have other work you could be doing. So instead of using a heavy-weight synchronization like 20 promises/futures or whatever, you just have your callback atomically bump an integer. When it reads “20”, the process is done and you can go on to actually use them.

So, what you are proposing is this ?

Main program:
Start building program 1, bind to callback A
Start building program 2, bind to callback A
…
Start building program 20, bind to callback A
Do some CPU-side processing
Callback A:
Increment static counter
Return if static counter has not yet reached 20
Otherwise, set up kernel 1
Enqueue associated work
Set up kernel 2
Enqueue associated work, with kernel 1 work in its wait list
…

There are quite a number of problems with this solution, the most obvious being that a naive implementation of this design will deadlock in any non-preemptive OpenCL implementation. The reason is that unless you regularly run some clXyz function in your CPU work, you will never give the callback scheduler any chance to run, and thus your callback will never be invoked by the OpenCL implementation.

Thus, you need to remind yourself to regularly poll for the CL implementation for something in your CPU processing, making your implementation equivalent to the design I proposed above… except this time, the obvious implementation of your algorithm deadlocks, and the polling has no clear justification, which illustrates the poor usability of callbacks.

Moreover, your proposal is less efficient at keeping the driver fed with work. If there is one long-running compilation task for program 20, for example, all but one driver thread will be idle, whereas in the design I propose they could already be processing the work associated to program 1, program 2, and so on.

I see no reason why they should add this feature that serves your particular use case.

Asynchronous program builds are a very valid use case of the OpenCL standard, so much in fact that it has already been featured in it. The problem is that the abstraction they use to this end, callbacks, is a poor fit, leading to many concurrency problems such as deadlocks and races in this specific scenario. This is why I believe events or futures would be a better fit, and propose that the standard be amended to use such a synchronization primitive instead in the long run.

Oh and FYI: The plural of “scenario” in English is “scenarios”. Even in Italian, it is “scenari”, not “scenarii”, which is considered archaic.

Ah, great! I’ve been struggling with this horrible spelling for god knows how long, it’s good to know that whoever taught me that was either misinformed or a few centuries late. Thanks!

Implementing a basic promise/future is not a hard problem. It’s not like the OpenCL’s implementation is going to make use of specialized asynchronous compute units or queue processes or anything. It’ll just use whatever features are available to the OS, just like you would. Equally importantly… the efficiency of such an implementation is irrelevant. So unless you do something boneheaded in your promise/future, your application’s performance will not be impacted.

Basically, in my implementation, I have to ask the context for a user event, then in a callback poll the build status for every device to check that the program has built well, in order to decide what the final event status should be. Then I need to set an event status, and somewhere else in my program, another piece of code needs to react on that event. That large number of API calls can be avoided in the implementation, because by the time it would invoke a build callback said implementation already knows very well whether the built went well or not. In a future-based or asynchronous scenario, all the implementation would have to do instead is to flip a bit in the “built” field of its cl_program structure, or set the “status” field of an event to CL_COMPLETE or CL_BUILD_PROGRAM_FAILED.

That later solution means much less round trips between the API and the user program, and no need to set up a callback, which in turn I would expect would lead to much better efficiency. Maybe you’re right that such efficiency would be irrelevant though.

That would only matter if you and some other code were interoperating within OpenCL. That is, both of you were talking to OpenCL and you needed to cooperate. If so, either you’re creating programs and they’re picking out kernels, or you’re picking out kernels from programs they create. In the first case, all you need to do is use your synchronization mechanism to make sure that the program is finished before calling the receiving API. In the second case, they will already have made sure their programs are finished before calling you. Either way, there’s no need for them to use your layered APIs directly.

Okay, let’s put it in another way : why would Khronos standards have extensions, new releases, and a “suggestions” forum thread if they were considered so perfectly desinged as to never require any change, because any new feature can be implemented in libraries ?

When one designs a standard, one cannot think about everything, no matter how much effort is expended in this direction. Sometimes, better ways to do things emerge, and the standard should be updated to support these new ways. I believe that what I am proposing here is such a better way to do things.

HadrienG · November 24, 2015, 1:14am

But where would the data come from initially? Would vendor-specific program vendors be expected to provide wrapper-specific metadata, creating fragmentation in the OpenCL ecosystem? Or would users be expected to tag their binaries themselves, in an error-prone process?

Conversely, is it really reasonable to expect wrapper libraries to know of, and keep track of, vendor-specific ILs, when vendor implementations of OpenCL already know everything there is to know about said IL?

HadrienG · November 24, 2015, 2:48am

Oh, and by the way…

I take issue in you considering future-based synchronization to be heavy-weight in this case.

All OpenCL programs objects already have a CL_PROGRAM_BUILD_STATUS, for every device they are associated with. To implement program futures, all that would need to be done is to add to the cl_program struct a hidden global build status, which reflects whether the program has been built for all target devices (or the build has failed).

In fact, it is safe to assume that vendor implementations of OpenCL already have such a global build status around, because they need it in order to know when they should fire up an asynchronous build callback.

Once you have such a global build status somewhere, synchronizing with a program which is asynchronously being built is nothing more than waiting on a condition variable, which has highly optimized implementations in every OS and library-based concurrency framework out there. Sure, that may not be as lightweight as incrementing an atomic counter (good for you that you remembered to make it atomic, by the way, you avoided yet another callback-originated race condition), but it is not what I would call a heavyweight synchronization primitive overall. It is certainly on the same order of complexity as calling a callback at the right time, which happens behind the scene in your proposal.

Alfonse_Reinheart · November 24, 2015, 6:55am

Higher level than is absolutely necessary to have a functioning abstraction that can be implemented on multiple kinds of hardware. What you suggest is not absolutely necessary.

[QUOTE=HadrienG;39633]So, what you are proposing is this ?

Main program:
Start building program 1, bind to callback A
Start building program 2, bind to callback A
…
Start building program 20, bind to callback A
Do some CPU-side processing
Callback A:
Increment static counter
Return if static counter has not yet reached 20
Otherwise, set up kernel 1
Enqueue associated work
Set up kernel 2
Enqueue associated work, with kernel 1 work in its wait list
…[/quote]

No, the callbacks just bump the count. They don’t set up kernels or enqueue work or anything.

Setting up kernels and enqueuing work happens sequentially, and it only happens after all of the programs have finished compiling.

I admit that I’m not the most experienced person with regard to the OpenCL standard, but I don’t think that behavior would be legal, relative to the standard. I could not find anything about the behavior of host asynchronous processes in OpenCL that say that implementations are allowed to progress them only if the user regularly calls OpenCL functions.

If callbacks can only be called when you call an OpenCL function, that would make most uses of continuation functions pointless. Just take your original example. Your callback wants to get a kernel from the program and enqueue work (all perfectly fine to do during a callback). But if that’s the case, why does the rest of your application need to even talk to OpenCL again until after the callback has done its job? Maybe if you have some more OpenCL work to do, but what if you’re out of OpenCL work?

Asynchronous calls are allowed to be synchronous; they’re allowed to complete by building the program immediately. But I don’t think they’re allowed to only advance when arbitrary OpenCL functions are called.

Building programs is usually an initialization-time step. That will have no impact on overall application performance or throughput. Not to mention, doing what you suggest requires that I pre-sort my programs; not all of them are necessarily independent of each other. So I would have to put them in order, so that I wouldn’t try to enqueue something that’s dependent on another program’s completion.

Oh, and I don’t recall that the OpenCL implementation specifies ordering guarantees. That is, if you call clBuildProgram twice, I don’t think it guarantees that the second will finish only after the first.

Not to mention, you don’t just build a program and use it. You also need data, which tends to have to be loaded from disk or wherever. I can be doing that while the compile is going on. If you load buffer data first then compile, what are you going to do while your compile is moving forward?

Stop. Why do you need a user event? OpenCL events are for controlling how queues progress, and asynchronous program building is not based on queues.

You would only need to have some form of thread synchronization primitive. This is standard CPU threading stuff; having your callback tell the rest of the system that the process succeeded or failed. It doesn’t require any OpenCL interaction (beyond the clGetProgramBuildInfo stuff).

Note that OpenCL’s implementation of this would also need to do this kind of thread synchronization. By using the callback method, the user decides what kind of synchronization to use. Some user might want a mutex. Another might want an atomic counter. Or whatever.

I prefer letting users decide how best to do this, rather than forcing programmers to use the model that works best for your preferred use case.

sigh Far too many people think that if you’re arguing against a suggestion, that you’re against all changes.

I don’t think your idea is a good one. I think your idea is too focused on your particular use cases, at the expense of others. I think your idea makes it harder to implement different models of asynchronous processing.

That doesn’t mean I’m against any changes. This is just about your suggestion.

And I believe that it would require needless overhead from the OpenCL implementation, as well as remove entirely legitimate asynchronous models.

Once you have such a global build status somewhere, synchronizing with a program which is asynchronously being built is nothing more than waiting on a condition variable, which has highly optimized implementations in every OS and library-based concurrency framework out there.

Yes, and the implementation would need one such condition variable for every program object you compile. If what I need is to know when all 20 of the programs have completed, all I need is an atomic integer, which can be shared among 20 programs.

A condition variable per program is heavy-weight by comparison.

This is why synchronization should be left to the user to decide how it ought to be done.

HadrienG · November 24, 2015, 10:21am

Beyond mere functionality and portability, I believe that usability is also a design goal for a hardware abstraction layer. Otherwise, operating systems wouldn’t bother with APIs and simply tell clients to make raw kernel calls.

No, the callbacks just bump the count. They don’t set up kernels or enqueue work or anything.

Setting up kernels and enqueuing work happens sequentially, and it only happens after all of the programs have finished compiling.

Then please provide, like I did, an outline of which algorithm you would use, and in particular how the code that sets up kernel would synchronize with compilation callbacks.

At this point, I do not understand how you intend to simultaneously perform CPU work and wait for the callback to be called without using either multiple threads or polling.

I admit that I’m not the most experienced person with regard to the OpenCL standard, but I don’t think that behavior would be legal, relative to the standard. I could not find anything about the behavior of host asynchronous processes in OpenCL that say that implementations are allowed to progress them only if the user regularly calls OpenCL functions.

This may be a pessimistic outlook, but whenever I read specifications like OpenCL, my brain mentally goes through a quick list of things which can go wrong in implementations. In my opinion, the right way to look at specification conformance is not “this is not explicitly allowed, and so won’t happen” but rather “this is not explicitly forbidden, and thus may happen”.

All that the OpenCL spec has to say on this matter is “This callback function may be called asynchronously by the OpenCL implementation.”. Notice the “may” in this sentence, because it is important. Through this wording, implementations are perfectly allowed to synchronously call callbacks at explicit synchronization points. And in fact, depending on the hardware/OS combination you are running, this may be the best way.

To understand why, put yourself in place of an OpenCL implementor. Your implementation is composed of two parts, a client-side library, which resides in the application’s address space and obeys normal C library rules, and a system-side device driver, which may have special OS-specific abilities (including, perhaps most importantly, that of interacting which the device).

When an OpenCL application runs, it does its own thing most of the time, and only interacts with the OpenCL API through client library function calls. In standard C, the communication may only go in this direction (application calls library), because there is no language-defined abstraction which will allows the library to wake up following an injuction of the device driver, interrupt the normal control flow of the application, and jump to callback code. So if we’re aiming for maximal portability across operating systems and sticking with the standard abstractions of C, callbacks may only be called by the client-side library when the application code makes an OpenCL API call.

Now, of course, if you leave the realm of standard C and enter that of OS-specific functionality, then things can be different. On some operating systems, you get an abstraction, like UNIX signals, which allow an application process to be asynchronously interrupted by a driver process. But such abstractions are not universal. Some OSs have them, some don’t. And even when an OS has it, it may be too slow or buggy to be practical, or the security model may not allow the OpenCL driver to interrupt the application with such a signal. It really depends a lot on which OS you are running. So I believe you should not count on OpenCL implementations to use such OS-specific functionality.

If callbacks can only be called when you call an OpenCL function, that would make most uses of continuation functions pointless. Just take your original example. Your callback wants to get a kernel from the program and enqueue work (all perfectly fine to do during a callback). But if that’s the case, why does the rest of your application need to even talk to OpenCL again until after the callback has done its job? Maybe if you have some more OpenCL work to do, but what if you’re out of OpenCL work?

Asynchronous calls are allowed to be synchronous; they’re allowed to complete by building the program immediately. But I don’t think they’re allowed to only advance when arbitrary OpenCL functions are called.

See above. A standard C implementation of OpenCL callbacks simply cannot do it in another way, and we can’t count on OS-specific functionality being always available and good enough. Now you start to understand why I believe callbacks are clunky for anything which client code must synchronize on

Building programs is usually an initialization-time step. That will have no impact on overall application performance or throughput. Not to mention, doing what you suggest requires that I pre-sort my programs; not all of them are necessarily independent of each other. So I would have to put them in order, so that I wouldn’t try to enqueue something that’s dependent on another program’s completion.

Oh, and I don’t recall that the OpenCL implementation specifies ordering guarantees. That is, if you call clBuildProgram twice, I don’t think it guarantees that the second will finish only after the first.

Not to mention, you don’t just build a program and use it. You also need data, which tends to have to be loaded from disk or wherever. I can be doing that while the compile is going on. If you load buffer data first then compile, what are you going to do while your compile is moving forward?

I am sorry, I do not understand your point here. In a future-based program compilation process, you can perfectly do this :

Load program 1 source, start building it
Load program 2 source, start building it
…
Load program 20 source, start building it – At this point, all programs are asynchronously being built, but the main thread is not blocked yet
Load buffer data from disk and send it to buffer 1
Load buffer data from disk and send it to buffer 2
…
Set up kernel 1 – This causes synchronization with the compilation process of program 1, which is hopefully finished

I also do not rely on OpenCL-provided ordering guarantees here. All I ask is for an OpenCL implementation where clCreateKernel blocks if passed as an argument a kernel which is not built yet, instead of aborting with an error code.

Stop. Why do you need a user event? OpenCL events are for controlling how queues progress, and asynchronous program building is not based on queues.

You would only need to have some form of thread synchronization primitive. This is standard CPU threading stuff; having your callback tell the rest of the system that the process succeeded or failed. It doesn’t require any OpenCL interaction (beyond the clGetProgramBuildInfo stuff).

You are right that there are many synchronization primitives that I could use here. I am using the OpenCL event synchronization primitive because it fits my needs (even if may not have been designed for this specific purpose), and it is guaranteed to be there on any OpenCL-supported platform.

C, being an old programming language without any built-in concurrency primitive, cannot provide me with what I need, unless I use OS-specific functionality and thus lose portability. OpenCL chose to use C as its preferred implementation language, so if I want to propose a solution which can be implemented in the core OpenCL spec, I need to abide by the limitations of C.

Note that OpenCL’s implementation of this would also need to do this kind of thread synchronization. By using the callback method, the user decides what kind of synchronization to use. Some user might want a mutex. Another might want an atomic counter. Or whatever.

I prefer letting users decide how best to do this, rather than forcing programmers to use the model that works best for your preferred use case.

OpenCL is about platform independence. In the C programming language, synchronization mechanisms such as those you mentioned are either CPU- or OS-specific, which is very much the opposite of platform independence. C11 tries to change that, but knowing how fast compilers and toolchains are at supporting new programming language releases, it may take decades before we can assume C11 support in every OpenCL-supported C build environment. Moreover, the atomics and thread support library of C11 is marked as an optional feature in the standard, which means that even in a very distant future, we may never be able to count on it being available.

This means that OpenCL needs to provide its own synchronization primitives in order to allow for platform-independent asynchronous processing. It cannot just throw the towel and hand it upon the user to find a cross-platform thread synchronization library that fits his or her needs. That would go against the very basic goals of the specification.

sigh Far too many people think that if you’re arguing against a suggestion, that you’re against all changes.

I don’t think your idea is a good one. I think your idea is too focused on your particular use cases, at the expense of others. I think your idea makes it harder to implement different models of asynchronous processing.

That doesn’t mean I’m against any changes. This is just about your suggestion.

Fine by me ! Even though this conversation may look like an endless and pointless argument to an exterior eye, I personally greatly appreciate your feedback on my proposal, which allows me to improve it. Even if I don’t manage to have it integrated upstream, it will be all the more awesome as a wrapper-specific feature.

And I believe that it would require needless overhead from the OpenCL implementation, as well as remove entirely legitimate asynchronous models.

…that is, deadlock- and race-prone ones, I know. We have already established that you do not like OpenCL users very much, and want all of them to go through the concurrency nightmares that you like yourself

Yes, and the implementation would need one such condition variable for every program object you compile. If what I need is to know when all 20 of the programs have completed, all I need is an atomic integer, which can be shared among 20 programs.

A condition variable per program is heavy-weight by comparison.

This is why synchronization should be left to the user to decide how it ought to be done.

sigh

When you see callbacks in a library, do you take it for granted that they come in for free, and never wonder how they are managed by the implementation ?

Any OpenCL implementation built upon sane engineering principles, where the program compilation process is decoupled from the part of the implementation that sends callback notifications, will need a condition variable to implement asynchronous callbacks. You are basically trying to implement a synchronization primitive on top of another, and then lecturing me about the efficiency of concurrent code.

I believe that I understand very well the inefficiency of what you are proposing, thank you very much.

HadrienG · November 24, 2015, 12:19pm

Correcting myself again, because actually things are worse than I thought. In an OpenCL implementation where program object compilation is done in the device driver, we are basically nitpicking about the overhead of client-side thread synchronization mechanisms, while fully asynchronous callbacks as you advocate would involve the use of an interprocess communication primitive (the UNIX signal or equivalent), which in many operating systems implies a full round trip through the OS kernel with all the overhead that implies, followed by the client-side overhead of the callback call itself.

If all you want is event signalling, it is possible to build something much lighter than this. Especially if event latency is not a critical concern.

Salabar · December 26, 2015, 1:18pm

The property CL_DEVICE_HOST_UNIFIED_MEMORY should be tiered. My Mullins APU’s GPU pretty rightfully reports this property to be true, but it doesn’t actually mean jack for an use-case I’m currently investigating. More precisely, AMD’s driver can allocate a read-only buffer that is uncacheable (effectively write-only) by the CPU, but it is the only way for non-HSA APUs to utilize nature of the hardware.

Salabar · December 28, 2015, 6:26am

Gah, I apparently was mislead by AMD’s documentation, but non-HSA APU CAN use zero-copy buffers, not sure, if driver uses pinning or not. Writing this in order to not confuse anyone else.

Salabar · July 4, 2016, 1:00pm

Proposal:
Memory object flag “CL_MEM_TRANSIENT”. Similiarly to this concept in Vulkan, the content of transient buffers or images are undefined when a kernel launches and after it finishes. This allows efficient application controlled register spilling in the cases of low occupancy kernels.

Bilog · July 16, 2016, 10:30am

This is in fact something that is sorely missing in OpenCL: a way to guarantee host buffer pinning for memory transfer. The only moderately reliable approach I’ve found is to use ALLOC_HOST_PTR, and even that doesn’t really guarantee that the host-side will be pinned.

Dithermaster · January 24, 2017, 4:24pm

Except that SPIR is independent now of OpenCL (e.g., works with Vulkan too) so impossible to lock it’s version number to OpenCL. Except is SPIR 1.2 and 2.0 which were exclusive to OpenCL.

tqchen · February 28, 2017, 12:54pm

Seems this should be the official suggestion thread… I duplicate my post to here

I used CUDA for quite a while for deep learning and recently I started to look into OpenCL. In general the OpenCL spec looks great, except for one place, which I think CUDA actually did a better job.
It is about thread-safety. OpenCL is thread-safe everything except for KernelSetArg, and that is the place which creates pain.

To put it short, OpenCL’s kernel calling function relies on KernelSerArg, which is not thread-safe, while CUDA’s calling convention is thread-safe(because arguments are allocated by caller), which uses void** to pass the arguments.

cudaLaunchKernel(kernel, grid-configurations, void** args);

I understand clearly the OpenCL’s guideline said that a separate kernel should be create per-thread. Specifically, it is hard to expose a PURE function interface in a thread-safe way. For example, consider the following code.

class CLWorkspace {
public:
static CLWorkspace* Global() {
static CLWorkspace inst;
return &inst; 
}
std::function<Tensor(Tensor, Tensor)> GetFunc(const std::string source) {
cl_kernel k = // compile logic
return [k](Tensor, Tensor) {
// launch logic
});
}
};

int main() {
auto myadd = CLWorkspace::Global()->GetFunc("myadd.cl"); 
Tensor a, b;
Tensor c = myadd(a, b);
return 0;
}

The above code is quite valid and can represent a way to quickly get std::function that can add two tensors together. It works great for front-end user because it abstract away the details such as kernel handle(no body want to get a list of kernel handles when performing multiple operations). Unfortunately, it is not thread safe, and the corresponding myadd can only work on one thread.
This makes it hard for the users. Because user will simply assume that the created resource is a function, and can be called from multiple threads. In advanced deep learning system where multi-threaded scheduling is used, this causes trouble when the function gets called from different threads.

Add a function that like CUDA will resolve this problem, and make the entire API threadsafe.

sean.settle · March 9, 2017, 8:07pm

Feature request: extend vector types to n = 1 where it makes sense, e.g., for “cl_float1 foo = 1.f;” allow “foo.s0” and “foo.x”, but no “foo.hi” or “foo.lo”. This could unify C code that handles built-in vector types with that of built-in scalar types, which currently has to be treated separately.

AngelGabriel · March 13, 2017, 3:30am

UP-VOTE : for call-back on completion of any/all events in a wait list.

Presently this is implicitly possible by picking some other operation that takes an event wait list as an argument (e.g. clEnqueueNDRangeKernel) and using the status event to trigger a call-back. If it can be done in all of these cases surely it can be done without being associated with a specific operation?