vkCreateDevice failed with ERROR_INITIALIZATION_FAILED on CentOS 7 Cluster

Hello everyone,

we’re trying to get Vulkan (and the SDK) to run on our GPU cluster powering a multi-projector CAVE. However, we’re having trouble getting it to work even on a single computer of the cluster.
The main issue is that any call to vkCreateDevice, no matter from where, fails with ERROR_INITIALIZATION_FAILED. We tried all steps on two Clusters with different hardware:

Hardware

  • Cluster 1: 2x Quadro P6000
  • Cluster 2: 1x GTX 780ti
  • Shared cluster filesystem, but the SDK was explicitly tested on the local filesystem of a single node with a regular monitor attached to one GPU.

Software

  • CentOS 7.8
  • Packages:
    • vulkan.x86_64 (1.1.97.0-1.e17)
    • vulkan-devel.x86_64 (1.1.97.0-1.e17)
    • Vulkan-filesystem.noarch (1.1.97.0-1.e17)
    • gcc 4.8.5 (CentOS default, unloaded)
    • gcc 7.3.0 (loaded via module load gcc/7)
    • Nvidia Unix Driver 450.80.02 [tested with various other versions as well]
  • nvidia-smi:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.80.02    Driver Version: 450.80.02    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P6000        Off  | 00000000:3B:00.0 Off |                    0 |
| 26%   18C    P8     9W / 250W |     65MiB / 22916MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Quadro P6000        Off  | 00000000:86:00.0 Off |                    0 |
| 26%   24C    P8     9W / 250W |    177MiB / 22916MiB |      0%   E. Process |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     70106      G   /usr/bin/X                         62MiB |
|    1   N/A  N/A     70106      G   /usr/bin/X                         64MiB |
|    1   N/A  N/A     70164      G   /usr/bin/gnome-shell              109MiB |
+-----------------------------------------------------------------------------+
  • nvidia_icd.json
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "libGLX_nvidia.so.0",
        "api_version" : "1.2.133"
    }
}
  • Alternatively tried with
{
    "file_format_version" : "1.0.0",
    "ICD": {
        "library_path": "/lib64/libGLX_nvidia.so.0",
        "api_version" : "1.2.133"
    }
}

Issues with installed system libraries

  • Calling vulkaninfo works and produces the following output: [cannot link or upload file]
  • However, trying to start any other vulkan application fails to create the device, e.g. using a different vulkaninfo from the SDK:
    • ERROR at /tmp/vulkan/1.2.154.0/source/Vulkan-Tools/vulkaninfo/vulkaninfo.h:1515:vkCreateDevice failed with ERROR_INITIALIZATION_FAILED
    • Any other application also fails (vkcube, hologram, Unreal Engine)

Because the system libraries did not work, we tried it with the SDK and got the following issues:

** Issues with vulkansdk source build (with and without system libraries installed): **

  • Dependencies all install well
  • ./vulkansdk all runs through flawlessly (with gcc 7)
  • source setup-env.sh works and sets the correct paths
  • Pre-built /tmp/vulkan/1.2.154.0/x86_64/bin/vulkaninfo Fails at vkCreateDevice [cannot link or upload file]
  • Source-built vulkaninfo fails exactly in the same way
  • Source and pre-built vkcube and API-Samples > 02 also fail.

Attempts to fix the issue and get more information:

  • vulkaninfo logs:

    • System package
      • [cannot link or upload file]
    • SDK
      • [cannot link or upload file]
    • SDK with validation layers turned on and setting an explicit icd path to the nvidia driver
      • export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump:VK_LAYER_KHRONOS_validation
      • VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json
      • [cannot link or upload file]
    • With the above and additionally enabled all output from the loader:
      • VK_LOADER_DEBUG=all
      • [cannot link or upload file]
  • Running vkvia for different configurations yields the following output, all of the failing at createDevice:

    • vkvia with SDK
      • [cannot link or upload file]
    • vkvia with validation layers turned on and setting an explicit icd path to the nvidia driver
      • export VK_INSTANCE_LAYERS=VK_LAYER_LUNARG_api_dump:VK_LAYER_KHRONOS_validation
      • VK_ICD_FILENAMES=/etc/vulkan/icd.d/nvidia_icd.json
      • [cannot link or upload file]
    • vkvia with the above and additionally enabled all output from the loader:
      • VK_LOADER_DEBUG=all
      • [cannot link or upload file]
  • Additionally, we tried running strace -f to see if anything suspicious was happening there, but found nothing interesting for now.

    • strace -f vulkaninfo with system libraries
    • strace -f vulkaninfo with SDK libraries
    • strace -f vulkaninfo with SDK libraries and all debug output set
  • We also tried setting some other variables, but didn’t explicitly create logs for it as they all changed nothing:

  • __NV_PRIME_RENDER_OFFLOAD=1

  • __VK_LAYER_NV_optimus=NVIDIA_only

  • __GLX_VENDOR_LIBRARY_NAME=nvidia

We’re starting to arrive at our wits’ end here, any input on what else we can try or what we might have missed in the logs or steps that we took would be greatly appreciated. We also created an issue on the Vulkan SDK lunarg website.

Unfortunately I don’t seem to be able to upload or link to the logs or the issue. The logs are attached to the issue on the luarg vulkan issue tracker.

Thank you all in advance for any help!
David Gilbert

It turns out that the Compute Mode was set to Exclusive_Process via nvidia-smi -c 3. Changing the Compute Mode back to Default (nvidia-smi -c 0) fixed the issue.

It would be awesome if anyone here could shed some light on why this happens, and it might be worth to mention this issue in the driver docs somewhere.

Hmm, interesting. Thanks for following up with the resolution. According to the nvidia-smi docs the compute mode resets to default when the machine reboots; that suggests to me that some init script (or systemd unit, etc) sets it on your system. I don’t really have a good idea how to track this down though, perhaps run a grep -r nvidia-smi /etc to see if anything there changes the mode?

Yeah, we talked to the Cluster admin, and there was actually a script doing just that. It was leftover from when the Cluster was part of a general GPU cluster used as a, well, compute cluster.

It fixed a few other issues we had as well, but I still don’t really understand why it was blocking simple vulkan applications such as vulkaninfo.