long email with multiple topics
To better abstract the problem of X-API, as an hypothesis imagine first an host application is CPU based (passes images via RAM) yet implements compute intensive nodes/operators on GPU. There nodes are dynamic libraries. A user system has 2 GPU (for simplicity same model - and then what is same model is RTX 270 and 280 the same model?), and during render for example (could be in same process space or not) it alternates which GPU renders which frame (often these systems actually chunk in like 10 frames segments for example instead of pure alternate - and some actually have branches breaking compute even in interactive session on a complex project). If you are a third-party to that application, and that application is currently running on an nVidia card whose driver library might support DirectX, OpenCL, OpenGL, Cuda and Vulkan… chances are that the plugin library and the host application don’t speak the same API. Right now there is no service by the device driver vendor to report cross-API which GPU is which so tasks can be executed on some agreed GPU.
For GPU UUID/LUID, which is better between PCI bus level or not? Does PCI bus level addresses eGPU on thunderbolt 3 (and with 2 GPU inside of eGPU) or embedded GPU or dual GPU sharing one PCI connection or… I think in nVidia case they even have env variable to switch from their fastest heuristic table to PCI order and I read in vK the order is upside down to Cuda… We need one way that is universal, GPU API agnostic (and eventually non GPU compute device should be reported too).
There is a small set of things that if done at API level implies N to N mapping, in a perfect world would be better N to 1. Sometimes many APIs is better than trying to fit all in one. The slide you refer to is not very clear. You do need to maintain at API level the semantic associated - e.g. Image/texture is using the hardware interpolator. There are other data types than images one might want to share on GPU (e.g. audio, point data streams (a list of something) - I see there is now Shared Virtual Memory in specs - still getting to speed).
With regards to Interop/sharing, I am all for images and buffers being mapped outside of an API per se. In a way like Apple Image texture and buffer from CVL surface (Metal and OpenGL today). It’s how it works for example in Apple Video Pro Apps today. And it’s traditionally how people have use OpenGL to mix and match OpenCL and Cuda and GL processing with shaders,
The only thing I ever read reference cards usually
Would still be a good idea to create a reference spec sheet addenda documenting existing OpenCL/specs/3.0-unified/html/OpenCL_Ext.html#cl_khr_gl_sharing
is still in 3.0 doc. We don’t want this to be EXT per vendor.
I heard at zoom panel the mention of the word module, is this the terminology for a set of values &| API calls for a color coded theme in the reference card? Some other APIs call these suites. I haven’t look at 3.0 in much details yet, but the reference card is not clear to me as to what is optional, was is not. Shouldn’t there be for each of these modules, a top-level query moduleSupported(“moduleString”, version) sort of thing. Retrospectively if you look at how DirectX interop/sharing evolved, it’s kind of bad. It should just be “Direct3D”, 12 … which gives you the correct module pointer, as well as for older version of apps vs dynamic library a simple way to regress until one version is a match.
In general OpenCL should decouple itself from VK as opposed to try to be like it (except for those that are paid by lines of code). I think the 3 classes of tasks where Compute API like OpenCL are historically used are: video processing (loosely defined as more than streaming a video for playback), video analysis (as a general term that includes all variants of Machine Learning), and things like crypto-mining. For the later, the task is well defined, and whether a compute device is qualified is based on it providing the right answer or not. Those are classic Compute tasks, and trying to do this as a GLSL shader in VK is a bad idea.
Also at the panel there was a discussion about OpenCL over Metal. I am not sure about all these layers, interceptor, converter particularly if they pass through GLSL like in moltenVK. Seems right now VK should just render OpenGL and OpenCL interopted from that as first pass, if an host is Vulkan natively today. I am not aware there is a commercial professional product running on Vulkan natively yet and shipping. On mac it’s a bit silly if one actually tried to use a converter chain like OpenCL to vK GLSL to Metal shaders in the end. What could go wrong? - Comment: I am sure someone at nVidia has written a CUDA over Metal library, not clear nVidia would sell CUDA software-only like Apple sells their OS for Hackintosh or Microsoft same for VM like Parallels on a Mac… Point is in such case there is nothing graphics rendering about usage. The only thing used in video processing that is graphics in nature I can think of is image meshing (i.e. warping), which is still faster in a render scheme than via a compute API. So instead of worrying about the Graphics Library interop, another approach would be to have a hook for a client graphics library and pass it a small set of abstracted calls (not that many needed). Sort of the reverse of a compute barrier in a GPU API for processing that are not a graphics rendering pipeline per se.
There might be some bloating in specs right now similar to VK specs in API - for example who came up with CL_ UNORM_INT_101010_2? I know this nonsense started with V2 - There is a flattening of pixel component byte order and some colorspace e.g. sRGB into a long list. This should be separate. sRGB(…) should be striked out of Image formats, it’s a colorspace. If you want to predefine some colorspace fine, but don’t do it at pixel component/data types level.
I understand you want to have edge computing supporting OpenCL too and they won’t be supporting OpenGL (or Vulkan). Right now there is a bit of a double language versus OpenCL 1.2 C languages (common lowest denominator for Objective-C, C#, C and C++) as compliance and OpenCL 3.0 as optional.
// Getting in OT space here:
I like google: “Mixed OpenGL - Metal”, having issue with web interface not liking links
Imagine if we could use the AMD/Intel embedded graphics to render the GUI and still have a viewport to render some compute intensive 3D thing in a viewport… I understand a driver owns a screen right now but on Windows you can set an application to power savings and it will run with the embedded graphics card, so this sort of driver synchronization might be possible and useful even if one would need to implement it to as IPC or something. I remember when I first started to support OpenCL I got tired to change graphic cards so I figure I could install AMD driver then NVIDIA driver, and if there was a monitor connected to an AMD card and one to an NVIDIA card I could test both in parallel.
BTW and OT: I discovered this while googling following the panel, google: “ISPC”
I don’t think we are candidate for Intel like high-level parallel_for single source but I am going to try this toolkit, we do have a layer of templated inlines calling macros basically different code paths - and there is an SSE/AVX crack in the house but it’s not me. I can’t write intrinsic code from scratch without a cheat sheet in front of me and about one word per minute… - to be able to OpenCL to vectorized CPU code and back and test translation sounds perfect for me