Reboot or killing jobs remotly after crash?

NoChance · February 7, 2011, 4:00pm

Hi everyone!

I work at a research university and I’m beginning to learn OpenCL for a numerical computations research project…

The plan is to buy a desktop for the office to run computations…
The issue is that I’ll frequently need to work remotly, logging into the machine from home, or from a difference office.

The question is: what happens when my program crashes?

(1) If the GPU locks up, will I be able to log in remotely and kill the process, or just reboot the machine remotely?

(2) If the machine has 2 GPU’s (for example one on-die with the CPU), is it possible to run my computations on one GPU while the other GPU does “normal” GPU duties (my programs won’t be outputting to monitor, just to data files) – and in this case, if I’m acually at the desktop, can program crashes be more easily handled?

any advice or suggestions welcome!

thanks!

NoChance

chai · February 9, 2011, 12:38pm

ATI made a guide for using stream (and therefore, their OpenCL implementation) through ssh:
http://developer.amd.com/gpu_assets/App … motely.pdf

Chances are, you won’t be able to just kill the process. Any lockups I’ve had, the process (which is handled by the host CPU) cannot pre-empt the execution of a kernel within the GPU. reboot would be a good option, you could also get remote power-up working, but sometimes you might need a hard-reset. I haven’t had too much trouble with OCL lockups, usually some kind of error pops up and crashes the app before locking everything up, but have definitely played with this behavior on DirectX11.

you CAN run on one GPU and let the other one do whatever it does, but it probably won’t help you at all with debugging a crash. It just means you have another available GPU.