Frozen Computer

I am looking for starting points for a (I assume) timing problem. My Vulkan app runs for about an hour then the whole computer freezes and requires the power to be disconnected before the system will reboot.

The system has an Nvidia geforce 970 and AMD Ryzen 9 3900x.

Any starting points welcome.

Hello there,

we need more information about the specific problem in order to help you:

  • What exactly do you think could be the cause of the problem. What do you mean with “timing problem”? A synchronisation issue?
  • What operating system are you using?
  • What exactly is your Vulkan application doing? Graphics or compute tasks?
  • Please post some code if you think it’s the cause of the problem.
  • Do you use Vulkan validation layers?
  • Have you tried debugging your application?
  • Are you using a graphics debugger like RenderDoc?
    https://www.saschawillems.de/blog/2016/05/28/tutorial-on-using-vulkans-vk_ext_debug_marker-with-renderdoc/
  • Which programming language are you using?

It’s quite impressive to me that your app is freezing your computer. Even if you have an application which fully loads CPU and GPU, I feel like this shouldn’t completely freeze your machine.

Are you sure it’s a Vulkan issue? Could it be a problem which originates from other code parts? What other parts are there in your code?

From what I feel the cause could be a memory management problem. On Windows, 64 bit applications are not restricted to 2GB of RAM anymore so as far as I know they can allocate as much as they want. Does your memory consumption increase slowly with time? Use task manager to check.

In case you are using C++:

  • Do you use RAII ?
  • Do you use standard library containers instead of manual memory management?
  • Do you use smart pointers?
  • Do you use “naked” new and delete anywhere in the code?

best regards,
Johannes

Thanks for taking the time to reply Johannes.

The app is a just a harness for a vulkan library being developed for our civil engineering package. The crash is
in graphics only logic.

The timing theory is a bit of a guess but is based on a couple of things. The crash is a bit random but
I am generating about 400 fps so I thought a resource problem would occur sooner. Also I have just had to
replace my old Intel, rather slow, motherboard with a newer, much faster, AMD board and I did not see this problem
before then.

I am running Windows 10 64bit.

It is hard to post code as I am unsure exactly where the crash is happening.

My general logic maybe be the cause. My renderer is split into tasks such as tin, grid tins, polylines, points, text, pipes, culvates etc.
Each part builds its own command buffer and submits it then flags a fence on completion. Then I wait on all the
fences before presenting the frame buffer. This logic is not currently threaded but that is the long term plan. I am thinking
that I maybe should build these commands as secondary commands and have a thread just to manage the submit.

Because this is not a game, the real applicaion does not continually update the frame only if something changes as there is
no animation.

The project being rendered is extremely large using all of my 4K graphics card plus several KB of the 16K host memory.

I use all the standard validation layers and I have tried to use the new synchronization layer but the documention on that is misleading
saying to set VK_DEBUG_REPORT_ERROR_BIT_EXT with vkCreateDebugUtilsMessengerEXT which you cannot do so I put in the logic as best as I could
but I don’t know if it is actually active. (My next task to check).

RenderDoc crashes when I try to load this project perhaps because of the size of the data. It works with smaller test cases but they do not generate the
computer crash either.

Thanks for the link to Sasha’s tutorial that is new to me.

I use Visual Studio with C++.

In general I use RAII logic, STL templates, no smart pointers and rarely don’t have new/delete in wrapper classes.

I consume a lot of memory when the app first starts but it does not seem to increase after that.

Hi,
please forgive me for popping this spooky suggestion.
Over the years I have met a particular problem that would emerge in high level situations too. The latest was recieving two identical messages sent over my phone. In earlier programming of mine, I would notice this kind of doubling of output. My latest work is a visual free-hand d2 polygon editor. It works ok, unless I perform one of the demanding tasks as first thing after opening the program (say right-clicking an island that are to be taken out of a tree and polygon recalculated). That will close my app. It can well be uninitiated pointers at play or other errors. But, at the time that the windowing system shows a message about a time-out of the graphics-driver, the console shows twice my debug message: “exits click()”. When I’ve clicked the message from the window-system away and my program has shut down, the consol is still open and shows only once “exits click()”.
Since this error happens at a high level too, one should have thought that It would have been fixed meny years ago. I sometimes think that it could be faul, low-level hacker-infiltration. but that’s in part a paranoid thought and part of out of my reach. If indeed my code fails (calls click() twice) and I get wrong debug output about it, I cannot fix it.
After reading the earlier posts about the same problem and following it online I’ve got the impression that a nvidia graphics driver seems to be the common denominator, but don’t hang me up on it.

// edit:
I’m using an old pc too. Maybe it’s a hardware-bug.
// more edits:
After reading an-intel-r-hd-graphics-4000-war-story I decided to delete all my glFinish()
The first program-run still had the shut-down bug, but none after that. This is just light testing, but it seems to work.

Hi Carsten, I do think the reason we see more on nVidia is just because they are very common especially in the commercial world. I think a Windows’ time-out will always give something unusually.

hi Tonyo,
I hope you don’t feel that I derail your thread.
Your problem reminds me of a symptom long time ago. I had cracy long functions, and sometimes execution would enter them, but not come back … just stiffened.
Anyway … the glFinish() was not God’s gift to debuggers. I’ve started to debug by sending debug-output to file. The last 4 halts has happened at void functions without return … no other error-types has appeared. I hope that this is the final problem … it looks promising.