My wishlist - feedback OpenCL 3.0

Hope I am on right forum :slight_smile:

driver library level: when more than one GPU is installed (multi-GPU), support for cross API passing of some universal GPU UUID - so host applications and libraries (e.g. plug-ins) can know which GPU is which across languages boundaries (outside of interop - even CPU based apps have GPU processing nodes). Only driver level that supports multi API know so is a job for driver makers.

I am not clear about OCL 3 drivers and sharing. Context is applications interfacing with dynamic libraries (plug-ins).
1.2 does have OGL sharing as part of the specs. Is that coming back as an optional module?
Or is going away in OCL 3 driver (as in some apps won’t work anymore as we currently all use OGL to OCL (or CUDA) right now for that - i.e. so OCL, OGL or CUDA native apps can let the plugin library use the API they want)

SPIR V might address the shader/compute unit part of the equation, but wish is there was a common cross-API library to move Images (textures) and Buffers around also separating such functionalities from a particular language/framework/API. It’s the same hardware somewhere.down there.

Overlays, images, vertex list - are 3 topics one might consider outside of compute units/shaders.

Actually (answering myself) I see there seems to be already Cuda-VK interop since Cuda 10 (don’t use Cuda right now), seems OpenCL looses a lot of its potency without Sharing / Interop carefully defined in third-party dynamic library scenarios where a lot of usage is. I don’t see why OpenGL sharing would not be part of the reference card in 2020. In a way maybe support same sharing from get go as Qt v6 plans by year-end? It’s very little API. As long as there is a clear way to enumerate - check what is supported on a given platform. Too much layering, converter, interceptor is quickly bad.

Hi Revisionfx, thanks for the feedback!

We’re currently looking at improving multi-GPU support via extensions to OpenCL 3.0. I can’t share details right now, but I can point you to features in Vulkan that we’ve used as inspiration for similar features in OpenCL:

First, regarding a device UUID, we’ve been looking at what Vulkan returns via VkPhysicalDeviceIDProperties. If we added a similar query to OpenCL it would allow uniquely identifying devices both within OpenCL and across APIs. This last part is interesting for interop - more on this in a bit.

We’ve also been looking at an extension to provide PCI bus/device/function information similar to VK_EXT_pci_bus_info. This will provide another mechanism to uniquely identify OpenCL devices in the system, at least for devices on a PCI bus.

Do either (or both) of these mechanisms sound like they will meet your needs?

Regarding interop, we’ve been looking at modernizing the OpenCL interop mechanisms using a method similar to Vulkan’s “external memory”. Neil touched on this briefly in his IWOCL OpenCL Update, see slide 14. We plan to use this mechanism for sharing with Vulkan initially, but hopefully other APIs also in the near future. This mechanism pairs very well with a UUID query, since a UUID or LUID can match a device in another API with an OpenCL device to potentially share memory more efficiently.

One note since you noticed we removed OpenGL sharing from the reference card: we actually removed all extensions from the reference card purely to save space - previous OpenCL reference cards were 20+ pages and were a bit too cumbersome. We know the the OpenGL sharing extension is still important, and it is still documented in the OpenCL extensions spec, and OpenCL 3.0 implementations may still choose to support it.

If you have any additional feedback, definitely let us know. Thanks!

long email with multiple topics

To better abstract the problem of X-API, as an hypothesis imagine first an host application is CPU based (passes images via RAM) yet implements compute intensive nodes/operators on GPU. There nodes are dynamic libraries. A user system has 2 GPU (for simplicity same model - and then what is same model is RTX 270 and 280 the same model?), and during render for example (could be in same process space or not) it alternates which GPU renders which frame (often these systems actually chunk in like 10 frames segments for example instead of pure alternate - and some actually have branches breaking compute even in interactive session on a complex project). If you are a third-party to that application, and that application is currently running on an nVidia card whose driver library might support DirectX, OpenCL, OpenGL, Cuda and Vulkan… chances are that the plugin library and the host application don’t speak the same API. Right now there is no service by the device driver vendor to report cross-API which GPU is which so tasks can be executed on some agreed GPU.

For GPU UUID/LUID, which is better between PCI bus level or not? Does PCI bus level addresses eGPU on thunderbolt 3 (and with 2 GPU inside of eGPU) or embedded GPU or dual GPU sharing one PCI connection or… I think in nVidia case they even have env variable to switch from their fastest heuristic table to PCI order and I read in vK the order is upside down to Cuda… We need one way that is universal, GPU API agnostic (and eventually non GPU compute device should be reported too).

There is a small set of things that if done at API level implies N to N mapping, in a perfect world would be better N to 1. Sometimes many APIs is better than trying to fit all in one. The slide you refer to is not very clear. You do need to maintain at API level the semantic associated - e.g. Image/texture is using the hardware interpolator. There are other data types than images one might want to share on GPU (e.g. audio, point data streams (a list of something) - I see there is now Shared Virtual Memory in specs - still getting to speed).

With regards to Interop/sharing, I am all for images and buffers being mapped outside of an API per se. In a way like Apple Image texture and buffer from CVL surface (Metal and OpenGL today). It’s how it works for example in Apple Video Pro Apps today. And it’s traditionally how people have use OpenGL to mix and match OpenCL and Cuda and GL processing with shaders,

The only thing I ever read reference cards usually
Would still be a good idea to create a reference spec sheet addenda documenting existing OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_gl_sharing
is still in 3.0 doc. We don’t want this to be EXT per vendor.

I heard at zoom panel the mention of the word module, is this the terminology for a set of values &| API calls for a color coded theme in the reference card? Some other APIs call these suites. I haven’t look at 3.0 in much details yet, but the reference card is not clear to me as to what is optional, was is not. Shouldn’t there be for each of these modules, a top-level query moduleSupported(“moduleString”, version) sort of thing. Retrospectively if you look at how DirectX interop/sharing evolved, it’s kind of bad. It should just be “Direct3D”, 12 … which gives you the correct module pointer, as well as for older version of apps vs dynamic library a simple way to regress until one version is a match.

In general OpenCL should decouple itself from VK as opposed to try to be like it (except for those that are paid by lines of code). I think the 3 classes of tasks where Compute API like OpenCL are historically used are: video processing (loosely defined as more than streaming a video for playback), video analysis (as a general term that includes all variants of Machine Learning), and things like crypto-mining. For the later, the task is well defined, and whether a compute device is qualified is based on it providing the right answer or not. Those are classic Compute tasks, and trying to do this as a GLSL shader in VK is a bad idea.

Also at the panel there was a discussion about OpenCL over Metal. I am not sure about all these layers, interceptor, converter particularly if they pass through GLSL like in moltenVK. Seems right now VK should just render OpenGL and OpenCL interopted from that as first pass, if an host is Vulkan natively today. I am not aware there is a commercial professional product running on Vulkan natively yet and shipping. On mac it’s a bit silly if one actually tried to use a converter chain like OpenCL to vK GLSL to Metal shaders in the end. What could go wrong? :slight_smile: - Comment: I am sure someone at nVidia has written a CUDA over Metal library, not clear nVidia would sell CUDA software-only like Apple sells their OS for Hackintosh or Microsoft same for VM like Parallels on a Mac… Point is in such case there is nothing graphics rendering about usage. The only thing used in video processing that is graphics in nature I can think of is image meshing (i.e. warping), which is still faster in a render scheme than via a compute API. So instead of worrying about the Graphics Library interop, another approach would be to have a hook for a client graphics library and pass it a small set of abstracted calls (not that many needed). Sort of the reverse of a compute barrier in a GPU API for processing that are not a graphics rendering pipeline per se.

There might be some bloating in specs right now similar to VK specs in API - for example who came up with CL_ UNORM_INT_101010_2? I know this nonsense started with V2 - There is a flattening of pixel component byte order and some colorspace e.g. sRGB into a long list. This should be separate. sRGB(…) should be striked out of Image formats, it’s a colorspace. If you want to predefine some colorspace fine, but don’t do it at pixel component/data types level.

I understand you want to have edge computing supporting OpenCL too and they won’t be supporting OpenGL (or Vulkan). Right now there is a bit of a double language versus OpenCL 1.2 C languages (common lowest denominator for Objective-C, C#, C and C++) as compliance and OpenCL 3.0 as optional.

// Getting in OT space here:

I like google: “Mixed OpenGL - Metal”, having issue with web interface not liking links

Imagine if we could use the AMD/Intel embedded graphics to render the GUI and still have a viewport to render some compute intensive 3D thing in a viewport… I understand a driver owns a screen right now but on Windows you can set an application to power savings and it will run with the embedded graphics card, so this sort of driver synchronization might be possible and useful even if one would need to implement it to as IPC or something. I remember when I first started to support OpenCL I got tired to change graphic cards so I figure I could install AMD driver then NVIDIA driver, and if there was a monitor connected to an AMD card and one to an NVIDIA card I could test both in parallel.

BTW and OT: I discovered this while googling following the panel, google: “ISPC”
I don’t think we are candidate for Intel like high-level parallel_for single source but I am going to try this toolkit, we do have a layer of templated inlines calling macros basically different code paths - and there is an SSE/AVX crack in the house but it’s not me. I can’t write intrinsic code from scratch without a cheat sheet in front of me and about one word per minute… :slight_smile: - to be able to OpenCL to vectorized CPU code and back and test translation sounds perfect for me :slight_smile:

Here’s how I would represent pixels in this API to remove historical baggage.

pixel channel num: 1,2,3,4,8,16,…//after 4 switches to power of 2 indexing wise
pixel raw data type (storage type): unsigned_BYTE, unsigned_SHORT, unsigned_INT, unsigned_HALF_FLOAT, unsigned_FLOAT
(maybe ok to have 64b channels although not of much use in practice for pixel arrays – BIG)
pixel byte order (packed channel format, 0-1-2-3): RGBA, BGRA, ABGR /* ABGR 4 chan only,under 4 maps to BGRA */
only 3 I know in use. It does not represent color just an index. Note Metal does not have ABGR.
e.g.: 1 channel - implicit, 2 Channels - RG or BG from same RGBA,BGRA,…, enumeration – 3 channels RGB or BGR
no one uses palettes anymore or 565 etc, you don’t need to be backwards compatible to IRIS GL - however this would be pixel compression/packing scheme.
Don’t put in API as it can generate a lot of useless auto-generated code branches.
So instead: pixel compressed data type? [raw versus compressed, a different list]
pixel color space: does not belong to this API
(sRGB,YUV,LAB,rec 709, rec 2020,…). Imply either a LUT or a transfer function.
By convention YUV would map to RGBA: GbGr(A), bGrG(A)
Finally some buffers require pixel parametrization as well, often black, white point and gamma or log, perhaps over-range boolean. For example one might be streaming 10bit RGB in a 32b container for a pixel? Point here is 10bit is 0-1023 (1023 - white point) for example. This scheme is probably only of use in FPGA, on a normal GPU it would be scaled to 16 bit. Similar 16_235… All this usually belong in codec space, not in internal processing between IO. Similar for depth buffers handling…
so maybe special pixel mapping: double black; double white; gamma or log (linear = gamma 1.0); maybe boolean over-range – with a callback pointer in an app for transfer function. Basic colorspace can be enumerated but beyond that should be support code examples not API entries. Cameras these days have arbtrary log-gamma curves that can’t be directly represented in an API like this. I am forgetting one case, planar formats (e.g. 4:2:2 - channels not same size) and mosaicking… Perhaps a mention for rowbytes (it’s common to use the negative rowbytes to identify top down compared to bottom up scanline order).

Ooops typo - too early :slight_smile:

pixel raw data type (storage type): unsigned_BYTE, unsigned_SHORT, signed_INT, signed_HALF_FLOAT, signed_FLOAT

I gave some more thoughts about Device UUID, at least for GPU it’s really an OS responsability. For example on Windows 10 if you look at Performance tab in Task manager it reports GPU 0, GPU 1,…for example it does report the integrated graphic as 0 and my nVidia card as 1. This should be the reference for all compute/GPU API. Only way I can think of for a dynamic library inside an host application not sharing directly GPU images/buffer to work. So all OpenCL should do is report OS index.