Nvidia GPU Virtual Memory Management

How to manage virtual and physical addresses of gpu memory?

Bruce-Lee-LY
5 min readSep 26, 2023

1 Common Memory Management

1.1 CUDA Runtime API

In CUDA programming, programmers usually use the following CUDART synchronization API to apply for and release gpu memory. For example, calling cudaMalloc and passing in the required gpu memory size will return the virtual address of the gpu memory. After use, cudaFree can be called to release.

__host__ __device__ cudaError_t cudaMalloc(void **devPtr, size_t size);
__host__ cudaError_t cudaMallocManaged(void **devPtr, size_t size, unsigned int flags = cudaMemAttachGlobal);
__host__ cudaError_t cudaMallocPitch(void **devPtr, size_t *pitch, size_t width, size_t height);

__host__ __device__ cudaError_t cudaFree(void *devPtr);

In addition, CUDART also provides an asynchronous API for gpu memory application and release for programmers to use. They only need to pass in the CUDA stream to call, and what is returned to the programmer is also the virtual address of the gpu memory.

1.2 CUDA Driver API

Nvidia provides a set of APIs in the user mode CUDA driver for gpu memory application and release. The returned results are no different from the CUDART API, but are different from the CUDART API at the usage level. For example, before calling cuMemAlloc, programmers need to manually use the CUDA driver API. Initialization (cuInit) and creation of context (cuCtxCreate), for calling cudaMalloc, these are done implicitly and are transparent to programmers.

CUresult cuMemAlloc(CUdeviceptr* dptr, size_t bytesize);
CUresult cuMemAllocManaged(CUdeviceptr* dptr, size_t bytesize, unsigned int flags);
CUresult cuMemAllocPitch(CUdeviceptr* dptr, size_t* pPitch, size_t WidthInBytes, size_t Height,
unsigned int ElementSizeBytes);

CUresult cuMemFree(CUdeviceptr dptr);

2 Virtual Memory Management

2.1 Feature

As far as the commonly used gpu memory management API is concerned, since programmers can only obtain the virtual address of the gpu memory, if there is a need to dynamically adjust the size of the gpu memory (such as vector expansion on the GPU), the user must explicitly apply for a larger piece of gpu memory and start from the original copying data from gpu memory to new gpu memory, freeing the original gpu memory, and then continuing to track the newly allocated gpu memory addresses often results in reduced application performance and higher peak gpu memory bandwidth utilization.

The introduction of the VMM API in CUDA 10.2 provides applications with a way to directly manage a unified virtual address space, which can decouple the virtual addresses and physical addresses of the gpu memory, allowing programmers to handle them separately. The VMM API allows programmers to map and unmap virtual addresses in gpu memory to physical addresses when appropriate. With the help of the VMM API, the need to dynamically adjust the gpu memory size can be better solved. You only need to apply for additional physical addresses and then map them with the expanded space of the original virtual address. There is no need to change the tracked gpu memory address, nor to copy data from the original gpu memory to the new gpu memory. Therefore, the VMM API can help programmers build more efficient dynamic data structures and better control the memory usage in applications. Refer to Introducing Low-Level GPU Virtual Memory Management.

2.2 API

The VMM API belongs to the CUDA driver API and mainly includes the gpu memory granular acquisition API, virtual address management API, physical address management API, mapping management API and memory access management API.

// Calculates either the minimal or recommended granularity.
CUresult cuMemGetAllocationGranularity(size_t* granularity, const CUmemAllocationProp* prop,
CUmemAllocationGranularity_flags option);

// Allocate an address range reservation.
CUresult cuMemAddressReserve(CUdeviceptr* ptr, size_t size, size_t alignment, CUdeviceptr addr,
unsigned long long flags);
// Free an address range reservation.
CUresult cuMemAddressFree(CUdeviceptr ptr, size_t size);

// Create a CUDA memory handle representing a memory allocation of a given size described by the given properties.
CUresult cuMemCreate(CUmemGenericAllocationHandle* handle, size_t size, const CUmemAllocationProp* prop,
unsigned long long flags);
// Release a memory handle representing a memory allocation which was previously allocated through cuMemCreate.
CUresult cuMemRelease(CUmemGenericAllocationHandle handle);
// Retrieve the contents of the property structure defining properties for this handle.
CUresult cuMemGetAllocationPropertiesFromHandle(CUmemAllocationProp* prop, CUmemGenericAllocationHandle handle);

// Maps an allocation handle to a reserved virtual address range.
CUresult cuMemMap(CUdeviceptr ptr, size_t size, size_t offset, CUmemGenericAllocationHandle handle,
unsigned long long flags);
// Unmap the backing memory of a given address range.
CUresult cuMemUnmap(CUdeviceptr ptr, size_t size);

// Get the access flags set for the given location and ptr.
CUresult cuMemGetAccess(unsigned long long* flags, const CUmemLocation* location, CUdeviceptr ptr);
// Set the access flags for each location specified in desc for the given virtual address range.
CUresult cuMemSetAccess(CUdeviceptr ptr, size_t size, const CUmemAccessDesc* desc, size_t count);

3 Call

Refer to cuda-sample, provides sample code for using VMM API to apply for and release gpu memory.

3.1 Memory Application

GPU memory application mainly includes several steps: obtaining gpu memory granularity, applying for a virtual address, applying for a physical address, mapping the virtual address to the physical address, releasing the physical address handle (note that the physical address is not actually released here) and setting access permissions.

cudaError_t vmm_alloc(void **ptr, size_t size) {
CUmemAllocationProp prop = {};
memset(prop, 0, sizeof(CUmemAllocationProp));
prop->type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop->location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop->location.id = currentDevice;

size_t granularity = 0;
if (cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

size = ((size - 1) / granularity + 1) * granularity;

CUdeviceptr dptr;
if (cuMemAddressReserve(&dptr, size, 0, 0, 0) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

CUmemGenericAllocationHandle allocationHandle;
if (cuMemCreate(&allocationHandle, size, &prop, 0) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

if (cuMemMap(dptr, size, 0, allocationHandle, 0) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

if (cuMemRelease(allocationHandle) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

CUmemAccessDesc accessDescriptor;
accessDescriptor.location.id = prop.location.id;
accessDescriptor.location.type = prop.location.type;
accessDescriptor.flags = CU_MEM_ACCESS_FLAGS_PROT_READWRITE;
if (cuMemSetAccess(dptr, size, &accessDescriptor, 1) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

*ptr = (void *)dptr;

return cudaSuccess;
}

3.2 Memory Release

GPU memory release mainly includes several steps: obtaining gpu memory granularity, demapping virtual address and physical address (note that the physical address is released immediately after demapping) and releasing the virtual address.

cudaError_t vmm_free(void *ptr, size_t size) {
if (!ptr) {
return cudaSuccess;
}

CUmemAllocationProp prop = {};
memset(prop, 0, sizeof(CUmemAllocationProp));
prop->type = CU_MEM_ALLOCATION_TYPE_PINNED;
prop->location.type = CU_MEM_LOCATION_TYPE_DEVICE;
prop->location.id = currentDevice;

size_t granularity = 0;
if (cuMemGetAllocationGranularity(&granularity, &prop, CU_MEM_ALLOC_GRANULARITY_MINIMUM) != CUDA_SUCCESS) {
return cudaErrorMemoryAllocation;
}

size = ((size - 1) / granularity + 1) * granularity;

if (cuMemUnmap((CUdeviceptr)ptr, size) != CUDA_SUCCESS ||
cuMemAddressFree((CUdeviceptr)ptr, size) != CUDA_SUCCESS) {
return cudaErrorInvalidValue;
}

return cudaSuccess;
}

4 Issue

4.1 P2P Access

Using CUDART to achieve peer-to-peer access to the device can directly call the cudaDeviceEnablePeerAccess API setting, while using VMM to achieve peer-to-peer access to the device requires calling the cuMemSetAccess API to set the access permissions of the gpu memory.

4.2 Bandwidth

The author once worked on a project, during which I compared and tested the difference in H2D, D2H and D2D bandwidth of the memory applied by VMM and cuMemAlloc (Tesla V100, CUDA 10.2, CUDA Driver 470.80, the host memory is ordinary memory or pinned memory), and found that VMM The bandwidth is slightly lower than cuMemAlloc. I have tried parallel optimization, asynchronous optimization and small packet optimization, but the effect is not obvious. After being puzzled, I reported it to Nvidia. After investigation, its US research engineers said that it was an internal bug in the CUDA driver. After sending a repaired version and testing the bandwidth of the two, there was no obvious difference.

--

--