使用 Linux vfio 将 Nvidia GPU 透传给 QEMU 虚拟机

Linux 上虚拟机 GPU 透传需要使用 vfio 的方式。主要是因为在 vfio 方式下对虚拟设备的权限和 DMA 隔离上做的更好。但是这么做也有个缺点，这个物理设备在主机和其他虚拟机都不能使用了。

qemu 直接使用物理设备本身命令行是很简单的，关键在于事先在主机上对系统、内核和物理设备的一些配置。

单纯从 qemu 的命令行来看，其实和普通虚拟机启动就差了最后那个 -device 的选项。这个选项也比较容易理解，就是把主机上的设备 0000:00:01.0 传给了虚拟机使用。

$ qemu-system-x86_64 -m 4096 -smp 4 --enable-kvm \
  -drive file=~/guest/fedora.img \
  -device vfio-pci,host=0000:00:01.0

系统及硬件准备

BIOS 中打开 IOMMU

设备直通在 x86 平台上需要打开 iommu 功能。这是 Intel 虚拟技术 VT-d(Virtualization Technology for Device IO) 中的一个部分。有时候这部分的功能没有被打开。

打开的方式在 BIOS 设置中 Security->Virtualization->VT-d 这个位置。当然不同的 BIOS 位置可能会略有不同。记得在使用直通设备前要将这个选项打开。

内核配置勾选 IOMMU

INTEL_IOMMU
│ Location: │
│ -> Device Drivers │
│ (2) -> IOMMU Hardware Support (IOMMU_SUPPORT [=y])

内核启动参数 enable IOMMU

BIOS 中打开，内核编译选项勾选还不够。还需要在引导程序中添加上内核启动参数

# 对应编辑 /etc/default/grub, 设置 GRUB_CMDLINE_LINUX=
$ cat /etc/default/grub
...
GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt vfio_iommu_type1.allow_unsafe_interrupts=1 rdblacklist=nouveau nouveau.modeset=0"
...
 
# 重新生成 grub 引导配置文件
$ grub2-mkconfig -o /boot/grub2/grub.cfg
 
# 将 vfio 相关 module 设置为开机 load
$ cat /etc/modules-load.d/vfio.conf
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Setting up IOMMU Kernel parameters

找到 nvidia GPU BusID

record PCI addresses and hardware IDs of the GPU

$ lspci -k | grep -i nvidia -A 3                   
41:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
        Subsystem: Device 1b4c:11bf
        Kernel driver in use: vfio-pci
        Kernel modules: nouveau
41:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
        Subsystem: Device 1b4c:11bf
        Kernel driver in use: snd_hda_intel
        Kernel modules: snd_hda_intel
# pci address => 41:00.0,41:00.1
# device id => 1b4c:11bf
 
# 这里找到了两张 nvidia 卡，它们的 device id 都是 1b4c:11bf, 一张是 Audio device
# 这样是不能 passthrough 进去的，因为:
# vfio-pci use your vendor and device id pair to identify which device they need to bind to at boot,
# if you have two GPUs sharing such an ID pair you will not be able to get your passthough driver to bind with just one of them
# 使用下面的脚本解决这种情况：
 
$ cat /usr/bin/vfio-pci-override.sh
#!/bin/sh
 
for i in $(find /sys/devices/pci* -name boot_vga); do
    if [ $(cat "$i") -eq 0 ]; then
        GPU="${i%/boot_vga}"
        AUDIO="$(echo "$GPU" | sed -e "s/0$/1/")"
        echo "vfio-pci" > "$GPU/driver_override"
        if [ -d "$AUDIO" ]; then
            echo "vfio-pci" > "$AUDIO/driver_override"
        fi
    fi
done
 
modprobe -i vfio-pci
 
# 把脚本传入 /etc/modprobe.d/vfio.conf
$ cat /etc/modprobe.d/vfio.conf
install vfio-pci /usr/bin/vfio-pci-override.sh
options vfio-pci ids=10de:1c82 disable_vga=1

使用 vfio 管理 GPU

# /etc/modprobe.d/vfio.conf, ids 为 lspci 找到的 hardware id, 多个设备的话用','分割
$ cat /etc/modprobe.d/vfio.conf
options vfio-pci ids=10de:134d disable_vga=1
 
# 禁用 NVIDIA nouveau 开源驱动, /etc/modprobe.d/blacklist.conf
$ cat /etc/modprobe.d/blacklist.conf
blacklist nouveau
 
# kvm 模块配置, /etc/modprobe.d/kvm.conf
$ cat /etc/modprobe.d/kvm.conf
options kvm ignore_msrs=1

重启系统，启动完成后查看当前的 nvidia GPU 是否被 vfio-pci 模块使用, 确认 IOMMU 功能确实打开。

$ dmesg | grep -e DMAR -e IOMMU | grep enabled
 
# 如果能搜索到
DMAR: IOMMU enabled
# 表示上述配置成功。
 
# 查看 GPU 是否被 vfio-pci 使用
# 另外注意检查看看 41:00.1 Audio device 是否也被 vfio-pci 使用
$ lspci -k | grep -i -e nvidia -A 3
41:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: vfio-pci # GTX 1050 Ti GPU 被 vfio-pci 使用
    Kernel modules: nouveau
41:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: vfio-pci # 发现 Audio device 也被 vfio-pci 使用了
    Kernel modules: snd_hda_intel
...

# list GPU IOMMU group
$ find /sys/kernel/iommu_groups/ -type l | grep 41:00
/sys/kernel/iommu_groups/27/devices/0000:41:00.0
/sys/kernel/iommu_groups/27/devices/0000:41:00.1
 
# 找到 IOMMU Group 管理的 PCI 设备
#!/bin/bash
shopt -s nullglob
for d in /sys/kernel/iommu_groups/*/devices/*; do
  n=${d#*/iommu_groups/*}; n=${n%%/*}
  printf 'IOMMU Group %s ' "$n"
  lspci -nns "${d##*/}"
done

使用 qemu 透传 nvidia GPU

准备好 centos7 镜像，然后在虚拟机里面安装 nvidia 官方闭源驱动和 cuda SDK

# 我从服务器上拷贝过来的是 vmdk 的镜像，先把它转换成 qcow2 的格式
$ /usr/local/qemu-2.9.0/bin/qemu-img convert -f vmdk -O qcow2 centos-7.3.1611-20180104.vmdk centos-7.3.1611-20180104.qcow2
 
# 使用 qemu 启动，注意-cpu 需要 kvm=off 参数
# kvm=off will hide the kvm hypervisor signature, this is required for NVIDIA cards
# since its driver will refuse to work on an hypervisor and result in Code 43 on windows
$ cat startvm.sh
#!/bin/sh
/usr/local/qemu-2.9.0/bin/qemu-system-x86_64 -enable-kvm \
-m 4096 -cpu host,kvm=off -smp 4,sockets=1,cores=4,threads=1 \
-drive file=./centos-7.3.1611-20180104.qcow2 \
-device vfio-pci,host=41:00.0,multifunction=on,addr=0x16 \
-device vfio-pci,host=41:00.1 \
-net nic,model=e1000 -net user,hostfwd=tcp::5022-:22 \
-vnc :1
 
# 这台虚拟机开了 vnc 和 ssh 端口转发，可以使用 vnc 或者 ssh 访问
 
# 从 host 进入虚拟机
$ ssh 127.0.0.1 -p 5022
 
# 查看虚拟机透传进来的显卡
$ lspci -k | grep -i nvidia -A 3
00:04.0 Audio device: NVIDIA Corporation Device 0fb9 (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel driver in use: snd_hda_intel
    Kernel modules: snd_hda_intel
00:16.0 VGA compatible controller: NVIDIA Corporation GP107 (rev a1)
    Subsystem: Device 1b4c:11bf
    Kernel modules: nouveau

安装 nvidia 驱动和 Cuda

nvidia 驱动需要从官方下载，如果先安装 cuda 的话会一同安装 nvidia 驱动。接下来采用虚拟机先安装驱动再安装 cuda 的步骤。

参考： installing-nvidia-drivers-centos-7 NVIDIA CUDA GETTINGS STARTED GUIDE FOR LINUX

安装 nvidia 驱动

下载地址： http://www.nvidia.com/object/unix.html

# update 后如果更新内核，需要重启
$ yum -y update
 
# 安装 gcc 、make 、glibc 等工具和库
$ yum -y groupinstall "Development Tools"
$ yum -y install kernel-devel
 
# Download the latest NVIDIA driver for unix.
$ wget http://us.download.nvidia.com/XFree86/Linux-x86_64/390.42/NVIDIA-Linux-x86_64-390.42.run
$ yum -y install epel-release
$ yum -y install dkms
 
# Edit /etc/default/grub. Append the following  to “GRUB_CMDLINE_LINUX”
rd.driver.blacklist=nouveau nouveau.modeset=0
 
# Generate a new grub configuration to include the above changes.
$ grub2-mkconfig -o /boot/grub2/grub.cfg
 
# Edit/create /etc/modprobe.d/blacklist.conf and append:
blacklist nouveau
 
# Backup your old initramfs and then build a new one
$ mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
$ dracut /boot/initramfs-$(uname -r).img $(uname -r)
 
# 重启 again
 
# Run the NVIDIA driver installer and enter yes to all options.
$ sh NVIDIA-Linux-x86_64-*.run
 
# 装好后再一次重启，lspci -k 看下 gpu 使用的驱动是否是 nvidia
$ lspci -k | grep -i nvidia -A 3
00:04.0 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
00:16.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
    Kernel driver in use: nvidia # 发现已经使用 nvidia 驱动
    Kernel modules: nouveau, nvidia_drm, nvidia
 
# 执行 nvidia-smi 看下输出和温度
$ nvidia-smi
Thu Mar 15 01:31:09 2018      
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.42                 Driver Version: 390.42                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:00:16.0 Off |                  N/A |
| 40%   32C    P0    N/A / 100W |      0MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
 
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
 
$ nvidia-smi -q -d TEMPERATURE
 
==============NVSMI LOG==============
 
Timestamp                           : Thu Mar 15 01:32:42 2018
Driver Version                      : 390.42
 
Attached GPUs                       : 1
GPU 00000000:00:16.0
    Temperature
        GPU Current Temp            : 32 C
        GPU Shutdown Temp           : 102 C
        GPU Slowdown Temp           : 99 C
        GPU Max Operating Temp      : N/A
        Memory Current Temp         : N/A
        Memory Max Operating Temp   : N/A

安装 cuda

下载地址： https://developer.nvidia.com/cuda-downloads 这里选择 runfile ，以后为了方便也可以选择 rpm(network)的方式，会自动帮我们安装 nvidia 驱动

$ wget https://developer.nvidia.com/compute/cuda/9.1/Prod/local_installers/cuda_9.1.85_387.26_linux
 
# Say no to installing the NVIDIA driver.
# The standalone driver you already installed is typically newer than what is packaged with CUDA.
# Use the default option for all other choices.
$ sh cuda_*.run
 
# 添加 CUDA 相关的环境变量
export PATH=$PATH:/usr/local/cuda/bin
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
 
# make samples
$ cd ~/NVIDIA_CUDA-9.1_Samples; make -j 4
$ cd bin/x86_64/linux/release
$ ./deviceQuery # 查询 gpu 信息
./deviceQuery Starting...
 
 CUDA Device Query (Runtime API) version (CUDART static linking)
 
Detected 1 CUDA Capable device(s)
 
Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version / Runtime Version          9.1 / 9.1
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 4040 MBytes (4236312576 bytes)
  ( 6) Multiprocessors, (128) CUDA Cores/MP:     768 CUDA Cores
  GPU Max Clock rate:                            1481 MHz (1.48 GHz)
  Memory Clock rate:                             3504 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
...
 
$ ./bandwidtTest # 使用 cuda 测试 gpu bandwidth
Running on...
 
 Device 0: GeForce GTX 1050 Ti
 Quick Mode
 
 Host to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            9719.0
 
 Device to Host Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            9215.8
 
 Device to Device Bandwidth, 1 Device(s)
 PINNED Memory Transfers
   Transfer Size (Bytes)    Bandwidth(MB/s)
   33554432            95525.1
 
Result = PASS
 
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

作者: 云联壹云李泽玺

GitHub: https://github.com/yunionio/cloudpods

开源地址： https://www.cloudpods.org/

Cloudpods 是一个开源的 Golang 实现的云原生的多云和混合云融合平台。Cloudpods 不仅可以管理本地的虚拟机和物理机资源，还可以管理其他公有云和私有云平台的资源。