Official OpenCL 2.0 Feedback thread

[LEFT]Khronos Releases OpenCL 2.0 Provisional Specification for Public Review
New generation of industry open standard for cross-platform parallel programming delivers increased flexibility, functionality and performance[/LEFT]

July 22nd 2013 – SIGGRAPH - Anaheim, CA – The Khronos™ Group today announced the ratification and public release of the OpenCL™ 2.0 provisional specification. OpenCL 2.0 is a significant evolution of the open, royalty-free standard that is designed to further simplify cross-platform, parallel programming while enabling a significantly richer range of algorithms and programming patterns to be easily accelerated. As the foundation for these increased capabilities, OpenCL 2.0 defines an enhanced execution model and a subset of the C11 and C++11 memory model, synchronization and atomic operations. The release of the specification in provisional form is to enable developers and implementers to provide feedback before specification finalization, which is expected within 6 months. The OpenCL 2.0 provisional specification and reference cards are available at

“The OpenCL working group has combined developer feedback with emerging hardware capabilities to create a state-of–the-art parallel programming platform - OpenCL 2.0,” said Neil Trevett, chair of the OpenCL working group, president of the Khronos Group and vice president of mobile content at NVIDIA. “OpenCL continues to gather momentum on both desktop and mobile devices. In addition to enabling application developers it is providing foundational, portable acceleration for middleware libraries, engines and higher-level programming languages that need to take advantage of heterogeneous compute resources including CPUs, GPUs, DSPs and FPGAs.”

Updates and additions to OpenCL 2.0 include:

Shared Virtual Memory
Host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.

Dynamic Parallelism
Device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.

Generic Address Space
Functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.

Improved image support including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.

C11 Atomics
A subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.

Pipes are memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.

Android Installable Client Driver Extension
Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

OpenCL BOF at SIGGRAPH, Anaheim, CA July 24th 2013
There is an OpenCL BOF “Birds of a Feather” Meeting on Wednesday July 24th at 4-5PM at the Hilton Anaheim, California Ballroom A & B, where attendees are invited to meet OpenCL implementers and developers and learn more about the new OpenCL 2.0 specification.

[li]OpenCL BOF[/li][li]OpenCL homepage[/li][/ul]


it’s great to see depth texture support, but only for 32 bit float, and 16 bit unsigned normalized integer formats. I suppose these would be GL_DEPTH_COMPONENT16 and GL_DEPTH_COMPONENT32F in OpenGL. Why is there no support for the good old 24 bit format like the GL_DEPTH24_STENCIL8 format? Can we get support for that, please?

Pipes: What is the idea behind the new pipe objects? Maybe I am too narrow-minded but I just come up with a good use case for them.

clGetKernelArgInfo: Could you add a cl_kernel_arg_info constant named CL_KERNEL_ARG_INFO_TYPE_SIZE? It would be possible to infer this information from the type name but that sounds not too nice for me.

General: Could you fix the documentation and documentation links. For example, enqueue_kernel is not listed in the side bar of the online manpages and I get asked for a password when trying to access get_default_queue.

work_group_prefixsum_{inclusive,exclusive}{add,min,max} functions are not named correctly, since they are not necessarily additions. Is it too late to change them to
{inclusive,exclusive}prefix{add,min,max} or something else that removes the “sum” part of the name?

Section 7.4 requires (for single-precision floating point numbers) an accuracy of 2.5 ulp for reciprocal and division. However, the fmod, modf, remainder, remquo functions require 0 ulp. This seems strange, since their implementation requires a division.

For me, the spec is ok but lacks these important features:

  1. A way to specify different-sized images in an image array. Critical for OpenCL-accelerated hybrid renderers ( like MentalRay, FurryBall, Octane, etc… ) and rasterizers.
    You added the get_image_width/height() which is very good… now let us to use image opaque pointers in a 2D/3D image array so we can do
    __kernel void main ( image2d_array_t arr )
    int w = get_image_width ( arr[0] ) ; // w = 128
    int h = get_image_height ( arr[0] ) ; // h = 512

    int w2 = get_image_width ( arr[1] ) ; // w = 16
    int h2 = get_image_height ( arr[1] ) ; // h = 64
    We need a system to fetch images from a large array without restrictions !
    May be you should allow us too to pass image2d_t handles to CL pipes ? :stuck_out_tongue:

  2. C++ support ( almost a partial feature set including virtual abstract functions/interfaces and some templates perhaps ). That’s critical to reduce code size and implementation. Without that, we must implement a zillion of functions, for instance to perform lighting ( with point lights, spot lights, directional lights, ambient lights… ). It would be much better to let us to use an interface class like ILight and then do virtual ILight::doLighting() as Directcompute/DX11.

  3. An extension to transmit kernels’s GPU debug info to CPU-side debugger app.
    Printf sucks and to debug using CPU CL devices is not good because the GPU behaves differently.

  4. malloc/free/new/delete C++'s operators INSIDE the kernel’s source so we can use linked lists and other kind of containers !

Nice work.

I have just a suggestion:

Very often, my kernels have the following structure:

  1. copy data from global to local memory. barrier.
  2. a subset of work-items in the work-group perform operations on the local data.
  3. barrier.
  4. another subset of work-items in the work-group perform operations on the local data in another order.
  5. barrier. copy the local data to the global memory.

the problem is that in step 2) or 4) a significant number of work-items may be “idling” *.
Of course, for avoiding idling work-items, it is possible to put steps 2) and 4) in different kernels, but then it requires a local/global memory transfer.

Then comes my suggestion: add a function for explicitly describing to the OpenCL compiler the range of active work-items between two barriers. This could enhance performance and power efficiency.

Maybe it is possible to do that with the new functions, but I have not seen how.

*: actually, I have been said that on SIMD hardware, the “idling” is only apparent, because all the work-items perform the same operations, and only the results of the operations are not written.

I would like to suggest the addition of a function that implements the concept of “wait for any event” (maybe clWaitForAnyEvent?), as described below:

The Khronos Group · GitHub

Regarding the importance of Pipes:

This is one of the most important new features of OpenCL 2.0 for 2 reasons:

  1. Pipes enable a very general way to achieve near-optimal memory performance by ‘hiding’ the memory read/write latency behind the queueing mechanism. Optimizing memory access patterns is one of the most difficult areas of OpenCL code optimization, and Pipes give us a way to achieve excellent memory utilization with a very simple and general approach.

  2. Pipes can also mitigate performance issues that arise from code that spends most of it’s time in a loop where that loop contains conditional code paths that are only triggered rarely. For example, let’s say we are searching a large dataset for certain types of patterns, and when a match is found then additional processing of that match is needed. Currently the GPU cores must pay the cost of that additional processing code on every iteration of the loop, even when no match is found! (because of the way SIMD works). With pipes, every time a match is found we can simply write an entry to a pipe and delegate the additional processing to a separate kernel that is reading from that pipe. Essentially we can now extract the seldom used (but always paid for) code from the loop and delegate it to a separate dedicated thread. The speedup to the original thread can be substantial depending on how much processing needs to be done on each match, and how often matches are actually encountered.

Both of these uses are going to significantly speed up real-life production OpenCL code that I run every day.

One thing that appears to be missing from the spec:

We need a way to read/write Pipes from the host side (i.e., not within a kernel). Not all hardware will support this, but where it is supported it will give a fantastic capability to stream data to a live kernel, and receive streamed results back in a very straightforward way. Doesn’t the HSA spec allow for host side queue read/writes as well? Why is this not in OpenCL?


Thanx :slight_smile: as mentioned in this thread, it would be better to have a kernel timeout feature for sloppy kernels. All my profiling sessions on time consuming kernels ended up crashing X

In the SIGGRAPH BOF presentation it is stated that “Writes to 3D images is now a core feature” and in the press release above “Improved image support including sRGB images and 3D image writes”. However, in the references pages for OpenCL 2.0, it still states that writes to 3D images is an optional feature “Writes to a 3D image memory object are only supported when the cl_khr_3d_image_writes extension is implemented”.

I find this confusing. Will writes to 3D images be available as a core feature and not just through an extension?

I truly hope that writes to 3D images will be mandatory as it gives huge performance gains in medical image processing field where we mostly work on 3D images. Currently, only AMD and Intel supports this feature on their GPUs.

I’d like a name for the architecture and sub-architecture via clGetDeviceInfo. Optionally also Series, Type and Version to split up the now useless CL_DEVICE_NAME. For CUDA you have compute capability, but on OpenCL you need have to have a lookup-table. For AMD you need an alike lookup-table as you get only a code-name, which is sort of a sub-architecture.

This information is very useful to support the decision which optimisation(s) can be used. For this we now only have information about the local memory, global mem cache, etc.

[QUOTE=smistad;29533]In the SIGGRAPH BOF presentation it is stated that “Writes to 3D images is now a core feature” and in the press release above “Improved image support including sRGB images and 3D image writes”. However, in the references pages for OpenCL 2.0, it still states that writes to 3D images is an optional feature “Writes to a 3D image memory object are only supported when the cl_khr_3d_image_writes extension is implemented”.

I find this confusing. Will writes to 3D images be available as a core feature and not just through an extension?

I truly hope that writes to 3D images will be mandatory as it gives huge performance gains in medical image processing field where we mostly work on 3D images. Currently, only AMD and Intel supports this feature on their GPUs.[/QUOTE]

Writes to 3D images is a core feature in OpenCL 2.0. This looks like a bug in the reference pages.

The 24-bit depth / stencil format is supported as an optional extension. These are described in the OpenCL 2.0 extension specification (refer to section 9.11)

I would really like a way to uniquely identify devices, particularly in a way that would allow identifying between platforms. What I am working on essentially is a system to distribute different tasks (unrelated processes working on different problems) between all devices in a system. However the same device could still be available on multiple platforms, and I would like to avoid scheduling a task on 2 different platforms that are using the same physical device. For example, a common and easy to detect without a feature would be if you have both the AMD and Intel platforms installed, both will expose the same physical CPU device. In that particular case it is easy to detect, but more difficult cases arise with multiple GPUs. What I want is the PCIe ID, or some device hash that would be based on it for PCIe devices for uniquely identifying a device within the system.

Hi everyone.

Just a quick note that I have read the OpenCL 2.0 specification and have made extensive notes to provide feedback. I am currently writing it up, and it will take me a week or so to compile all of my notes into something easily read. I haven’t seen much activity on this topic, and I wanted to reassure committee members that feedback is coming, so don’t pass the spec yet!

There’s a discrepancy in the OpenCL C Specification document (version 11):

  • Paragraph 6.5.1 (global) states that: “Variables defined at program scope (…) can also be declared in the global address space.”
  • Paragraph 6.5.3 (constant) states that: “Variables in the program scope must be declared in the __constant address space.”

awesome, thank you!
I hope it gets at least AMD/NV support

I don’t like the SVM mechanism you proposed in the spec… because:

  1. The existence of clSVMAlloc() sounds like a bad idea. The whole point of shared virtual memory(SVM) should be to REUSE an existing void* pointer externally-allocated using C/C++ malloc(), VirtualAlloc(), etc…
    Some libraries have their own memory manager and they are closed-source. How would I call clSVMAlloc() then?

Also, clSVMAlloc() requires a CL context which usually is linked to an specific cl_platformId and cl_deviceId set. Now, imagine I alloc a C/C++ struct and I want to share it to ALL the CL’s platforms, contexts and devices…
It would be impossible to share because I should create a void* pointer for each context with clSVMAlloc() … what if create a context per device in order to launch many kernels from different threads? The resource could not be shared !

  1. To pass the CL_MEM_READ_WRITE/CL_MEM_READ_ONLY, etc… at creation time in clSVMAlloc() may be not a good idea: I would like to use the resource as readOnly for some kernels while writeOnly for others…
    The read/write specification should be done at launch/execution time, not at creation time !

  2. clSetKernelArgSVMPointer() is a bit confusing… why I must use a different function to pass kernel’s arguments?
    Why not to use clSetKernelArg() as you do with other params?

To resume: I don’t like the mechanism you specified. I propose this instead:

  1. Let the user to allocate bytes using his own traditional C/C++ malloc/free calls (with a 4K-aligned requisite perhaps :p).
  2. Create the svmRead, svmWrite and svmReadWrite attributes and use them in the kernel’s args:

void myKernel ( const svmRead float3 input1, / “input1” will be marked CL_MEM_READ_ONLY due to “const” and svmRead */
svmWrite float output1, / “output1” will be marked CL_MEM_WRITE_ONLY */
svmReadWrite float output2 ) / “output2” will be marked CL_MEM_READ_WRITE */

Also, it would be a good idea to add some “cached” ones like svmReadCached and svmReadNonCached, so the user could control much better the need of hardware caches.

That’s much simpler and flexible… and you won’t need the clSetKernelArgSVMPointer(), clSVMAlloc() and clSVMFree() functions at all.

  1. Also, the 2.0 spec is lacking a very important feature we’re demanding for ages… The existence of a flag to disable the GPU driver’s watchdog.
    Currently, if a kernel takes more than 3-5 seconds it’s aborted and the graphics driver is reset ><

You simply cannot predict the time some tasks gonna take, because they are very branched or because your app is designed to run over very heterogeneous hardware with very different speeds ( like a Geforce 8500GTS vs a Titan ).
If have a very time consuming task that I cannot split effectively, I should get a way to indicate the driver I don’t want a stop by the watchdog omg!

So, please, add a CL_LONG_EXECUTION_TIME_KERNEL flag to clEnqueueNDRangeKernel() and clEnqueueTask().

2, Also, would be good to add a flag to indicate if the CL is a SoC/APU using shared memory, so we can plan better the buffer’s flags to use.
So, pls, add a CL_DEVICE_INTEGRATED flag to clGetDeviceInfo() as CUDA does.

  1. Finally, please, add compression formats to images… so we can use texture compression :smiley: