Nvidia GPU Virtualization

How to virtualize GPU into multiple instances?

Bruce-Lee-LY
4 min readSep 22, 2023

1 Background

As Nvidia GPU plays an increasingly important role in the fields of rendering, encoding, decoding and computing, major software manufacturers are also conducting more and more in-depth research on Nvidia GPU. Although Nvidia tends to be ecologically closed source, it is subject to huge constraints. Due to hardware cost pressure, improving GPU utilization and squeezing GPU performance have gradually become the focus of attention in the infrastructure field. Naturally, in order to pursue time-division multiplexing and space-division multiplexing of gpu memory resources and computing resources on the GPU, everyone began to consider software-defined GPUs, and GPU virtualization came into being.

2 GPU Virtualization

In the field of deep learning, the software call stack of Nvidia GPU is roughly as shown in the figure below, from top to bottom:

  • User APP: business layer, such as training or inference tasks, etc.
  • Framework: framework layer, such as tensorflow, pytorch, paddle, megengine, etc.
  • CUDA Runtime: CUDA runtime and surrounding ecological libraries, such as cudart, cublas, cudnn, cufft, cusparse, etc.
  • CUDA User Driver: user-mode CUDA driver, such as cuda, nvml, etc.
  • CUDA Kernel Driver: kernel state CUDA driver, refer to open-gpu-kernel-modules, such as nvidia.ko, etc.
  • Nvidia GPU HW: GPU hardware

Theoretically, each of the above layers can be used for GPU virtualization, but from an engineering perspective, considering feasibility, maintainability, overhead, and deployment, it is more appropriate to implement it at the CUDA driver or hardware layer.

2.1 Userland Virtualization

Currently, the more commonly used method is to hijack the dynamic library of the user-mode CUDA driver. Please refer to the cuda_hook open source code. By intercepting calls to the CUDA driver API, the isolation of gpu memory resources and computing resources is achieved. It not only has zero intrusion into user code, but also has high flexibility. Whether it is deployed in bare metal or combined with containerization, it is more convenient.

2.2 Kernel State Virtualization

By hijacking the CUDA driver dynamic library deployment, there may be a risk of user tampering, which is generally not tolerated on public clouds. The advantage of the kernel state is that it can prevent user tampering to a certain extent. However, due to Nvidia’s closed source nature, it is technically difficult to isolate gpu memory resources and computing power resources in the kernel state. At present, Alibaba cloud, Tencent cloud and Baidu cloud have been deployed.

2.3 Hardware Virtualization

Nvidia’s official hardware virtualization solution, MIG (Multi-Instance GPU), supports hardware-level isolation starting from the Ampere architecture. The degree of isolation is more complete, but it only supports a virtualization environment of up to 7 GPU instances.

3 Other

3.1 vGPU

Nvidia’s official virtual GPU solution is mainly used to support the delivery of graphics-rich virtual desktops and workstations. GPU resources can be re-divided to ensure that GPU resources can be shared among multiple virtual machines, or multiple GPUs can be assigned to one Virtual machines that improve the performance of any workload.

3.2 MPS (Multi-Process Service)

Nvidia’s official multi-process context fusion solution supports sending the kernels on multiple processes to the MPS server or directly to the GPU for calculation, avoiding frequent context switching of multiple processes on the GPU. The disadvantage is that the failure rate is high, especially the spread of faults between processes is generally intolerable.

3.3 Remote GPU

Pulling the GPU Server far away enables GPU pooling, breaks through the limit of the ratio of CPU to GPU, and expands GPU virtualization, which can maximize the use of GPU fragments in the cluster and improve GPU utilization. VirtAI Tech’s OrionX solution is currently in a leading position.

--

--