马春杰杰 Exit Reader Mode

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

最近有台四卡服务器在使用时,经常出现以下错误:

Unable to determine the device handle for GPU 0000:05:00.0: GPU is lost. Reboot the system to recover this GPU

据调查,此现象通常出现于batch size过大时,改小再运行,就会出现这个问题。

解决方案:

初步断定是GPU频繁启停问题,所以进行永久化:

sudo nvidia-smi -pm 1

目前测试良好。

sipl@sipl:~$ sudo nvidia-smi -pm 1
[sudo] password for sipl:
Enabled persistence mode for GPU 00000000:05:00.0.
Enabled persistence mode for GPU 00000000:06:00.0.
Enabled persistence mode for GPU 00000000:09:00.0.
Enabled persistence mode for GPU 00000000:0A:00.0.
All done.

此操作每次重启后失效,所以建议写进/etc/rc.local设置开机自启。