Nvidia GPU Pooling-Remote GPU

How to implement GPU remote service?

Bruce-Lee-LY
6 min readSep 22, 2023

1 Background

Nvidia GPUs benefit from their powerful computing capabilities in the field of deep learning, allowing them to dominate the data center all year round. Although GPU virtualization is used to achieve multi-task mixing, which improves GPU utilization and alleviates the long tail effect, the absolute value of GPU utilization is still not high, and the long tail phenomenon still exists.

Technical evolutions in a series of similar infrastructure fields such as network card pooling, storage pooling, memory pooling, and CPU pooling have given everyone some ideas about GPU pooling. Faced with GPU machines that rely on PCIe and NVLink to achieve small-scale connections, people are eager to call GPUs across TOR, across computer rooms, and even across regions, thereby reducing the overall GPU fragmentation rate of the cluster. GPU pooling emerged as the times require, and the key technology is The difficulty lies in implementing remote GPU.

In addition, GPU pooling can not only achieve the ultimate sharing of GPU resources in combination with GPU virtualization, but can also break through the limit of the ratio of CPU and GPU. In theory, any ratio can be achieved.

2 Remote GPU

In the field of deep learning, the software call stack of Nvidia GPU is roughly as shown in the figure below, from top to bottom:

  • User APP: business layer, such as training or inference tasks, etc.
  • Framework: framework layer, such as tensorflow, pytorch, paddle, megengine, etc.
  • CUDA Runtime: CUDA runtime and surrounding ecological libraries, such as cudart, cublas, cudnn, cufft, cusparse, etc.
  • CUDA User Driver: user-mode CUDA driver, such as cuda, nvml, etc.
  • CUDA Kernel Driver: kernel state CUDA driver, refer to open-gpu-kernel-modules, such as nvidia.ko, etc.
  • Nvidia GPU HW: GPU hardware

Currently, remote GPU is mainly implemented through DPU or interception API at the CUDA runtime/driver layer.

2.1 Fungible

Fungible connects GPUs through the network through DPU and PCI vSwitch, replacing the traditional PCIe direct connection method. The number of GPU connections also exceeds the traditional 8:1, reaching 55:1.

After testing, the DPU solution connecting 2 GPUs has less impact on performance. In some cases, DPU solutions have higher latency and lower bandwidth than PCIe. If you connect 10 GPUs to a DPU, the physical bandwidth will be reduced.

In January 2023, Fungible was acquired by Microsoft and joined Microsoft’s data center infrastructure engineering team, focusing on providing a variety of DPU solutions, network innovation and hardware system improvements.

2.2 rCUDA

A development project of the Parallel Architecture Group of the Universitat Politecnica de Valencia in Spain provides a remote GPU virtualization solution that supports the concurrent remote use of CUDA-enabled devices in a transparent manner. Not only can it be deployed in a cluster, allowing a single non-MPI application to use all GPUs in the cluster, thereby increasing GPU utilization and reducing overall costs, but it is also possible to run applications in virtual machines to access GPUs installed in remote physical machines. Its latest research results and plans will be posted on the official website. Those who are interested can visit the official website to learn more.

2.3 Bitfusion

The working principle of the GPU resource pool provided by Bitfusion is to intercept all CUDA service access at the CUDA driver level, then pass these service requests and data to Bitfusion server through the network, and then hand over these service requests to the real CUDA driver on the server side.

In July 2019, VMware, a cloud computing company owned by Dell, acquired Bitfusion. Afterwards, this solution was integrated into the vSphere platform and is mainly divided into two parts:

  • Bitfusion Server: install the GPU on the vSphere server (requires vSphere 7 or above), and then run Bitfusion server on it. Bitfusion server can virtualize physical GPU resources and share them with multiple users.
  • Bitfusion Client: Bitfusion client is a Linux virtual machine running on other vSphere servers (requires vSphere 6.7 or above). Machine learning workloads run on these virtual machines. Bitfusion will transmit their GPU service requests to Bitfusion server through the network, and then return the result after the calculation is completed. For machine learning workloads, remote GPUs are completely transparent and appear to be using local GPU hardware.

2.4 OrionX

In April 2019, VirtAI Tech, a Chinese AI accelerator virtualization and resource pooling service provider, was established. The founding team came from Dell EMC China Research Institute. It mainly provides users with AI accelerator virtualization and resource pooling software and solutions. The Orion vGPU software developed by it is a system software that provides resource pooling and virtualization capabilities for GPUs in the cloud or data center. Through an efficient communication mechanism, CUDA applications can run on any physical machine, container or virtual machine in the cloud or data center without mounting a physical GPU, while providing these applications with hardware computing power in the GPU resource pool.

OrionX is mainly implemented by intercepting the API of CUDA Runtime/Driver and its surrounding ecological libraries. The core is divided into the following parts:

  • Orion Controller: responsible for resource management of the entire GPU resource pool. It responds to the Orion client’s vGPU request, and allocates and returns Orion vGPU resources from the GPU resource pool to the Orion client’s CUDA application.
  • Orion Server: the back-end service program responsible for GPU resource utilization is deployed on each CPU and GPU node and takes over all physical GPUs in the machine. When the Orion client application is running, a connection to the Orion server is established through the resource scheduling of the Orion controller. Orion server provides an isolated running environment and real GPU hardware computing power for all CUDA calls of its applications.
  • Orion Client: simulates the Nvidia CUDA runtime environment and provides a new implementation of API interface compatibility for CUDA programs. By cooperating with other functional components of Orion, a certain number of virtual GPUs (Orion vGPU) are virtualized for CUDA applications. CUDA applications that use the CUDA dynamic link library can be set up through the operating system environment, so that the operating system is responsible for linking a CUDA application to the dynamic link library provided by Orion client at runtime. Since Orion client simulates the Nvidia CUDA running environment, CUDA applications can run directly on Orion vGPU transparently and without modification.

3 Other

3.1 Technical Difficulties

Due to Nvidia’s closed source nature, remote GPUs have many technical difficulties, such as the kernel launch mechanism, context mechanism, hidden APIs, etc. At the same time, there are also some engineering difficulties, such as the development and maintenance of thousands of API interceptions, etc. Refer to the cuda_hook open source code.

3.2 GPU Live Migration

GPU pooling combined with virtualization can better achieve elastic scaling of cloud services. However, when cluster resources are tight or the fragmentation rate is high, the success rate of elastic scaling is low. At this time, live migration technology of GPU tasks is urgently needed to cooperate with the cluster. Overall planning and scheduling improve the success rate of elastic scaling of cloud services.

--

--