马春杰杰 Exit Reader Mode

LXD如何添加MIG的GPU?

1、首先开启MIG

sudo nvidia-smi -i 0 -mig 1 # 针对第 0 块 GPU
sudo nvidia-smi -i 1 -mig 1 # 针对第 1 块 GPU
sudo nvidia-smi -r # 重置使之生效

如果重置不好使,就重启电脑。

2、然后确定划分的数量,这个划分不是随意划分的,是有固定搭配的:

nvidia-smi mig -lgip

root@ubuntu:~# nvidia-smi mig -lgip
+-------------------------------------------------------------------------------+
| GPU instance profiles:                                                        |
| GPU   Name               ID    Instances   Memory     P2P    SM    DEC   ENC  |
|                                Free/Total   GiB              CE    JPEG  OFA  |
|===============================================================================|
|   0  MIG 1g.5gb          19     7/7        4.75       No     14     0     0   |
|                                                               1     0     0   |
+-------------------------------------------------------------------------------+
|   0  MIG 1g.5gb+me       20     1/1        4.75       No     14     1     0   |
|                                                               1     1     1   |
+-------------------------------------------------------------------------------+
|   0  MIG 1g.10gb         15     4/4        9.75       No     14     1     0   |
|                                                               1     0     0   |
+-------------------------------------------------------------------------------+
|   0  MIG 2g.10gb         14     3/3        9.75       No     28     1     0   |
|                                                               2     0     0   |
+-------------------------------------------------------------------------------+
|   0  MIG 3g.20gb          9     2/2        19.62      No     42     2     0   |
|                                                               3     0     0   |
+-------------------------------------------------------------------------------+
|   0  MIG 4g.20gb          5     1/1        19.62      No     56     2     0   |
|                                                               4     0     0   |
+-------------------------------------------------------------------------------+
|   0  MIG 7g.40gb          0     1/1        39.38      No     98     5     0   |
|                                                               7     1     1   |
+-------------------------------------------------------------------------------+
|   1  MIG 1g.5gb          19     7/7        4.75       No     14     0     0   |
|                                                               1     0     0   |
+-------------------------------------------------------------------------------+
|   1  MIG 1g.5gb+me       20     1/1        4.75       No     14     1     0   |
|                                                               1     1     1   |
+-------------------------------------------------------------------------------+
|   1  MIG 1g.10gb         15     4/4        9.75       No     14     1     0   |
|                                                               1     0     0   |
+-------------------------------------------------------------------------------+
|   1  MIG 2g.10gb         14     3/3        9.75       No     28     1     0   |
|                                                               2     0     0   |
+-------------------------------------------------------------------------------+
|   1  MIG 3g.20gb          9     2/2        19.62      No     42     2     0   |
|                                                               3     0     0   |
+-------------------------------------------------------------------------------+
|   1  MIG 4g.20gb          5     1/1        19.62      No     56     2     0   |
|                                                               4     0     0   |
+-------------------------------------------------------------------------------+
|   1  MIG 7g.40gb          0     1/1        39.38      No     98     5     0   |
|                                                               7     1     1   |
+-------------------------------------------------------------------------------+

如上所示,这里以MIG 1g.10gb规格为例,意思是划分为4块,每块10G显存:

sudo nvidia-smi mig -i 0 -cgi 15,15,15,15 -C
sudo nvidia-smi mig -i 1 -cgi 15,15,15,15 -C

 

然后验证:nvidia-smi -L

(base) mcj@dell-PowerEdge-T640:~$ nvidia-smi -L
GPU 0: NVIDIA A100-PCIE-40GB (UUID: GPU-d13056a0-cfd8-5e3c-4eeb-33ca57386459)
  MIG 1g.10gb     Device  0: (UUID: MIG-0a8dad45-3186-5221-ae18-52e54ffee0ac)
  MIG 1g.10gb     Device  1: (UUID: MIG-79900084-ae1b-5c15-9918-3ded20a37e2e)
  MIG 1g.10gb     Device  2: (UUID: MIG-5e6f7df1-3da8-5314-bff5-1db7b6411e22)
  MIG 1g.10gb     Device  3: (UUID: MIG-fda37763-7efa-5567-95c3-dccb40fd17ca)
GPU 1: NVIDIA A100-PCIE-40GB (UUID: GPU-6ecfad43-f4ab-47df-e591-aa3076d8f8c6)
  MIG 1g.10gb     Device  0: (UUID: MIG-c132ee1e-8747-52a2-b34d-eeda899be9b1)
  MIG 1g.10gb     Device  1: (UUID: MIG-5eb02d73-0cc5-514b-bef4-0fb6f5f2d0d3)
  MIG 1g.10gb     Device  2: (UUID: MIG-952e3f05-64e1-5cf6-afb6-819955b3695e)
  MIG 1g.10gb     Device  3: (UUID: MIG-74c4eef8-ed1a-5921-a2d4-f9955e5736ac)

3、最后,将分好的实例分为LXD容器

lxc config device add t3 mig0 gpu \
gputype=mig \
mig.uuid=MIG-0a8dad45-3186-5221-ae18-52e54ffee0ac \
pci=0000:3b:00.0

其中,pci的查看:nvidia-smi --query-gpu=pci.bus_id --format=csv

找到其中对应的pci值即可~

4、关闭MIG

删除现有 MIG 实例

MIG 是两层结构:Compute Instance (CI)GPU Instance (GI) 内。删除顺序必须是 CIGI

查看现有 GI/CI

nvidia-smi mig -lgi # 列出 GPU Instances

nvidia-smi mig -lci # 列出 Compute Instances

删除方式(示例 GPU0):

# 删除 GPU0 上所有 CI

sudo nvidia-smi mig -i 0 -dci -gi ALL -C

# 删除 GPU0 上所有 GI

sudo nvidia-smi mig -i 0 -dgi ALL -C

GPU1 同样操作:

sudo nvidia-smi mig -i 1 -dci -gi ALL -C

sudo nvidia-smi mig -i 1 -dgi ALL -C

关闭 MIG 模式

清空之后就能关掉 MIG 模式:

sudo nvidia-smi -i 0 -mig 0
sudo nvidia-smi -i 1 -mig 0

重置 GPU 生效

sudo nvidia-smi -r

此时再 nvidia-smi -L,应该看到恢复成整卡(比如 A100-SXM4-40GB),不会再显示 MIG 子设备。

有些GPU不支持ALL写法,此时更推荐使用下面的脚本一键关闭(直接在终端粘贴运行就行):

cat > clear_mig_all.sh <<'EOF'
#!/usr/bin/env bash
set -euo pipefail

# 统计 GPU 数
gpu_count=$(nvidia-smi -L | grep -c '^GPU ')
if [[ "$gpu_count" -eq 0 ]]; then
  echo "No GPUs found."
  exit 0
fi

for g in $(seq 0 $((gpu_count-1))); do
  echo ">>> GPU $g: 删除 CI / GI"
  # 列出该 GPU 上的 GI 实例 ID(适配 nvidia-smi mig -lgi 的表格输出)
  gi_ids=$(nvidia-smi mig -lgi -i $g | awk '/^\|[[:space:]]*'"$g"'[[:space:]]/ {print $(NF-3)}')

  # 逐个 GI:先删 CI,再删 GI(有的驱动不支持 ALL,只能逐个删)
  for gi in $gi_ids; do
    sudo nvidia-smi mig -i $g -dci -gi $gi -C >/dev/null 2>&1 || true
    sudo nvidia-smi mig -i $g -dgi -gi $gi -C >/dev/null 2>&1 || true
  done

  # 关闭 MIG 模式(若本来已关,会安全忽略)
  echo ">>> GPU $g: 关闭 MIG 模式"
  sudo nvidia-smi -i $g -mig 0 >/dev/null 2>&1 || true
done

echo ">>> 重置全部 GPU 使配置生效"
sudo nvidia-smi -r

echo "完成。现在用:nvidia-smi -L 查看应只剩整卡设备。"
EOF

chmod +x clear_mig_all.sh
./clear_mig_all.sh

运行结果:

>>> GPU 0: 删除 CI / GI
>>> GPU 0: 关闭 MIG 模式
>>> GPU 1: 删除 CI / GI
>>> GPU 1: 关闭 MIG 模式
>>> 重置全部 GPU 使配置生效
The following GPUs could not be reset:
  GPU 00000000:3B:00.0: In use by another client
  GPU 00000000:D9:00.0: In use by another client

2 devices are currently being used by one or more other processes (e.g., Fabric Manager, CUDA application, graphics application such as an X server, or a monitoring application such as another instance of nvidia-smi). Please first kill all processes using these devices and all compute applications running in the system.