容器 Docker中Failed to initialize NVML: Unknown Error

后端java 运维 2024-02-10 12 0

参考资料 Docker 中无法使用 GPU 时该怎么办（无法初始化 NVML：未知错误）按照下面这篇文章当中引用的文章来（附录1） SOLVED Docker with GPU: “Failed to initialize NVML: Unknown Error” 解决方案需要的条件: 需要在服务器上docker的admin list之中. 不需要服务器整体的admin权限. 我在创建docker的时候向管理员申请了把握加到docker list当中了. 如果你能够创建docker你就满足这个条件了问题描述：在主机上nvidia-smi正常, 但是在docker上报错如标题. 解决: 使用上述方法修改. 但是有一些不同

我的docker没有/etc/nvidia-container-runtime/config.toml, 于是我自己新建了一个. 注意新建这个文件需要有docker的admin密码(不是服务器主机上docker 命令的管理员密码)

#在docker当中

cd /etc/nvidia-container-runtime/

sudo touch config.toml

sudo vim config.toml

#把下面的config.toml内容复制进去

#ESC, :wq

config.toml的内容是从服务器上抄的, 复制如下

disable-require = false

#swarm-resource = "DOCKER_RESOURCE_GPU"

#accept-nvidia-visible-devices-envvar-when-unprivileged = true

#accept-nvidia-visible-devices-as-volume-mounts = false

[nvidia-container-cli]

#root = "/run/nvidia/driver"

#path = "/usr/bin/nvidia-container-cli"

environment = []

#debug = "/var/log/nvidia-container-toolkit.log"

#ldcache = "/etc/ld.so.cache"

load-kmods = true

#no-cgroups = false

#user = "root:video"

ldconfig = "@/sbin/ldconfig.real"

[nvidia-container-runtime]

#debug = "/var/log/nvidia-container-runtime.log"

log-level = "info"

# Specify the runtimes to consider. This list is processed in order and the PATH

# searched for matching executables unless the entry is an absolute path.

runtimes = [

"docker-runc",

"runc",

]

mode = "auto"

[nvidia-container-runtime.modes.csv]

mount-spec-path = "/etc/nvidia-container-runtime/host-files-for-container.d"

不需要重启docker, 只要重启容器就可以了. 需要服务器docker admin list权限. 上面的链接当中, 使用命令sudo systemctl restart docker重启docker, 需要服务器admin权限,权限等级比较高. 我只是在docker list 当中. 我首先执行了sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi.（1.18更新：我甚至没有执行这一步，如果下次再出现这种情况我考虑只是重启我的docker试试看）然后再在主机当中重启我的container. 我使用docker ps -a查看我的container_id(36e1b3a9c2af), 然后使用docker stop 关闭我的container, 再使用docker start 重启

然后就成功了

附录1 I’ve bumped to the same issue after recent update of nvidia related packages. Fortunately, I managed to fix it.

Method 1, recommended

Kernel parameter The easiest way to ensure the presence of systemd.unified_cgroup_hierarchy=false param is to check /proc/cmdline : cat /proc/cmdline It’s of course related to a method with usage of boot loader. You can hijack this file to set the parameter on runtime https://wiki.archlinux.org/title/Kernel_parameters#Hijacking_cmdline nvidia-container configuration In the file

/etc/nvidia-container-runtime/config.toml set the parameter no-cgroups = false After that restart docker and run test container:

sudo systemctl restart docker

sudo docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi

Method 2 Actually, you can try to bypass cgroupsv2 by setting (in file mentioned above) no-cgroups = true Then you must manually pass all gpu devices to the container. Check this answer for the list of required mounts:https://github.com/NVIDIA/nvidia-docker/issues/1447#issuecomment-851039827 For debugging purposes, just run:

sudo systemctl restart docker

sudo docker run --rm --gpus all --privileged -v /dev:/dev nvidia/cuda:11.0-base nvidia-smi

Good luck Last edited by szalinski (2021-06-04 23:41:06)

推荐链接

评论可见，请评论后查看内容，谢谢！！！