After a VK_DEVICE_LOST, what options are there?

d.hubbard · April 15, 2019, 5:15pm

This is not quite a strong enough point to be worth opening a bug on github, so I’m just noting it here.

From this blog post about VK_NV_device_diagnostic_checkpoints, vkGetQueueCheckpointDataNV can be used after VK_ERROR_DEVICE_LOST.

Here’s the relevant quote from the Vulkan spec:

If the device encounters an error during execution, the implementation will return a VK_ERROR_DEVICE_LOST error to the application at a certain point during host execution. When this happens, the application can call vkGetQueueCheckpointDataNV to retrieve information on the most recent diagnostic checkpoints that were executed by the device.

If I’m reading the spec correctly, this is sort of analogous to VK_EXT_debug_utils and VkDebugUtilsLabelEXT. But a debug label is only reported to the app in a VkDebugUtilsMessengerCallbackDataEXT, which is only sent to the app when a log message is generated.

How much interest is there in a new extension to allow an app to request a VkDebugUtilsMessengerCallbackDataEXT from a VkQueue after VK_ERROR_DEVICE_LOST, the way NVidia devices can do now with a vendor extension?

d.hubbard · April 15, 2019, 7:21pm

It looks like VK_AMD_buffer_marker can do the same thing:

Record a unique 32-bit value at a unique offset in a host-visible buffer set aside for debug markers with vkCmdWriteBufferMarkerAMD
The host can inspect the buffer after receiving a VK_ERROR_DEVICE_LOST error.

Edit: Strike that, the spec for this extension specifically states it can be used after device loss.

krOoze · April 16, 2019, 7:11pm

How much interest is there in a new extension to allow an app to request a VkDebugUtilsMessengerCallbackDataEXT from a VkQueue after VK_ERROR_DEVICE_LOST, the way NVidia devices can do now with a vendor extension?

Layers used to report any VK_ERROR_*. That was considered bad, as Validation Layers are supposed to catch logical errors of the application (and be enabled only in debug build), not runtime errors. I think that feature has been moved (or at least considered) in the new “assistant” layer.

At this point I’m only hazarding a guess but it should be possible to prove the theory and see what happens.

Not sure what you refer to. The extensions explicitly say they work even after VK_ERROR_DEVICE_LOST, as you quote. Where is the hazardous guess?

d.hubbard · April 16, 2019, 7:31pm

Are we talking about the same thing? VK_ERROR_DEVICE_LOST is obviously bad, and enabling validation layers can help, but they aren’t going to catch everything. Here are some extensions (not layers) that can help debugging an issue that the layers didn’t catch:

Just for argument’s sake, I’ll propose a test:

Write an application that has no validation errors and uses descriptor dynamic indexing to pass an image array to the shader
Modify the shader so that it accesses beyond the bounds of the image array

This should always produce a VK_ERROR_DEVICE_LOST while also not reporting any errors from the validation layers. (Yes, this is exactly what GPU-Assisted Validation does, but ignore that for the sake of argument.)

Are we talking about the same thing? Here is my entire post. I speculate about VK_AMD_buffer_marker - that extension (not layer) contains no guarantee of being available after VK_DEVICE_LOST, but I speculate I could get it to work due to its use of a buffer.

krOoze · April 16, 2019, 8:28pm

Are we talking about the same thing?

Probably not. You suggested that VkDebugUtilsMessengerCallbackDataEXT (ergo Validation Layer message) should be reported. That is the part I responded to.

VK_AMD_buffer_marker - that extension (not layer) contains no guarantee of being available after VK_DEVICE_LOST

Oh okay. I assumed you quoted the guarantees in your bullet points, rather than guesses.

The spec says:

The primary purpose of these markers is to facilitate the development of debugging tools for tracking which pipelined command contributed to device loss.

So apparently it is the extension’s raison d’etre. That should be reiterated in normative parts of the spec. I suggest you make an Issue.

d.hubbard · April 16, 2019, 8:45pm

Ah, I overlooked that language in VK_AMD_buffer_marker.

Thanks! I’ve updated the post above.