Nvidia GPU Memory Pool-BFC

How to pool gpu memory?

Bruce-Lee-LY
13 min readSep 26, 2023

1 Memory Pool

It may be unfamiliar to mention the gpu memory pool, but relatively speaking, the memory pool is more familiar. Compared with the memory pool, the biggest difference between the gpu memory pool and the memory pool may be that the underlying APIs for applying for and releasing gpu memory/memory are different. Their upper-layer frameworks and principles are basically universal, so common memory pool management methods can be used for reference.

The memory pool is an efficient memory management mechanism in linux systems. It pre-allocates a certain number of memory blocks and forms a pool so that memory can be allocated quickly when memory is needed. Compared with traditional memory allocation methods, memory pool management can greatly improve the efficiency of memory allocation and release and reduce the generation of memory fragments.

1.1 Linux Kernel Memory Pool

In the linux kernel, memory pool management mainly uses the following two methods:

(1) Buddy System

The buddy system manages and allocates memory in units of pages. The size of all memory blocks in the memory pool is 2^n. It groups all free page frames into 11 block linked lists. Each block linked list contains a size of 1 , page frame blocks of 2, 4, 8, 16, 32, 64, 128, 256, 512 and 1024 consecutive page frames. When a block of memory needs to be allocated, the allocator finds the smallest available block in the memory pool and allocates it to the user. When a memory block is no longer used, the allocator will release it and check its “sibling blocks”. If the sibling blocks are also free, merge them into a large block and continue checking upward until it can no longer be merged. .

The advantage of the buddy system is that it can reduce the generation of memory fragmentation and also improve the efficiency of memory allocation. However, this method also has some disadvantages. For example, it is inconvenient to allocate and release memory blocks of irregular sizes, and there may also be a waste of memory.

(2) Slab Allocator

The slab allocator is a small memory allocation method further refined based on the large memory allocated by the buddy system. It is mainly aimed at some small objects that are frequently allocated and released, usually kernel data structures. Whenever such an object is applied for, the slab allocator allocate a unit of this size from a slab list, and when it is freed, save it in the list again instead of returning it directly to the buddy system, thus avoiding internal fragmentation.

The slab allocator can efficiently handle memory transactions that frequently apply for and release small objects such as kernel data structures. At the same time, it can significantly save memory space and avoid excessive memory fragmentation.

1.2 Other Memory Pool

(1) tcmalloc

Tcmalloc is an efficient and stable memory allocator developed by Google. It uses slab technology, which can effectively avoid memory fragmentation. The performance in a multi-threaded environment is very good, the speed is relatively fast, and it is suitable for use in large-scale, high-concurrency distributed cluster systems. At the same time, it supports adjusting parameters, which can be appropriately adjusted according to different application scenarios and hardware configurations to make memory allocation more efficient.

Tcmalloc does not support dynamic memory allocation, so it cannot be used interactively with the system’s built-in memory allocator. And because tcmalloc needs to be optimized, the size of the binary file needs to be increased, which affects the loading speed.

(2) jemalloc

Jemalloc is a general-purpose memory allocator developed by FreeBSD. It mainly uses hierarchical allocation to manage memory. It has excellent allocation and release performance in a multi-threaded environment, especially in virtual machines. It has achieved good performance.

Jemalloc uses a fixed-size multi-level memory allocation algorithm, which can easily lead to a waste of memory, and is slightly slower than tcmalloc in terms of memory allocation speed.

(3) mimalloc

Mimalloc is a lightweight memory allocator developed by Microsoft. It performs very well in terms of performance and code size. Compared with tcmalloc and jemalloc, the performance is improved by about 10%.

Mimalloc cannot coexist with other allocators and must completely replace the existing memory allocator, so there may be certain limitations in some special scenarios.

(4) dlmalloc

Dlmalloc is a memory allocator used in glibc. Due to the use of blocking technology, memory fragmentation is reduced during the memory allocation process, thereby improving memory utilization. It also supports multi-threaded concurrent operations and can meet the needs of high-concurrency and high-load application scenarios. In large-scale memory allocation scenarios, dlmalloc’s memory allocation speed is relatively fast and is suitable for high-concurrency and high-throughput application scenarios.

The dlmalloc blocking process may cause memory waste, so it is not suitable for some low memory usage application scenarios.

2 BFC GPU Memory Pool

Similar to the memory pool, the gpu memory pool mainly solves two types of problems. One is to improve the efficiency of frequent application and release of gpu memory, and the other is to reduce the generation of gpu memory fragments. On the other hand, the GPU may include implicit synchronization operations when processing gpu memory transactions, affecting program performance.

The BFC (Best-Fit with Coalescing) gpu memory management algorithm proposed by tensorflow is a simple version of the dlmalloc algorithm, supports multi-threaded scenarios, can allocate and recycle gpu memory blocks very efficiently, and reduce gpu memory fragmentation as much as possible.

2.1 Core Data Structures

(1) Chunk

Chunk represents the gpu memory block structure, and its members include size, pointer, usage status, and predecessor and successor pointers of the doubly linked list.

 // A Chunk points to a piece of memory that's either entirely free or entirely
// in use by one user memory allocation.
//
// An AllocationRegion's memory is split up into one or more disjoint Chunks,
// which together cover the whole region without gaps. Chunks participate in
// a doubly-linked list, and the prev/next pointers point to the physically
// adjacent chunks.
//
// Since a chunk cannot be partially in use, we may need to split a free chunk
// in order to service a user allocation. We always merge adjacent free
// chunks.
//
// Chunks contain information about whether they are in use or whether they
// are free, and contain a pointer to the bin they are in.
struct Chunk {
size_t size = 0; // Full size of buffer.

// We sometimes give chunks that are larger than needed to reduce
// fragmentation. requested_size keeps track of what the client
// actually wanted so we can understand whether our splitting
// strategy is efficient.
size_t requested_size = 0;

// allocation_id is set to -1 when the chunk is not in use. It is assigned a
// value greater than zero before the chunk is returned from
// AllocateRaw, and this value is unique among values assigned by
// the parent allocator.
int64_t allocation_id = -1;
void* ptr = nullptr; // pointer to granted subbuffer.

// If not kInvalidChunkHandle, the memory referred to by 'prev' is directly
// preceding the memory used by this chunk. E.g., It should start
// at 'ptr - prev->size'
ChunkHandle prev = kInvalidChunkHandle;

// If not kInvalidChunkHandle, the memory referred to by 'next' is directly
// following the memory used by this chunk. E.g., It should be at
// 'ptr + size'
ChunkHandle next = kInvalidChunkHandle;

// What bin are we in?
BinNum bin_num = kInvalidBinNum;

// Optional count when this chunk was most recently made free.
uint64 freed_at_count = 0;

bool in_use() const { return allocation_id != -1; }

#ifdef TENSORFLOW_MEM_DEBUG
// optional debugging info
const char* op_name = nullptr;
uint64 step_id = 0;
int64 action_count = 0;
#endif

string DebugString(BFCAllocator* a,
bool recurse) TF_NO_THREAD_SAFETY_ANALYSIS {
string dbg;
strings::StrAppend(
&dbg, " Size: ", strings::HumanReadableNumBytes(size),
" | Requested Size: ", strings::HumanReadableNumBytes(requested_size),
" | in_use: ", in_use(), " | bin_num: ", bin_num);
if (recurse && prev != BFCAllocator::kInvalidChunkHandle) {
Chunk* p = a->ChunkFromHandle(prev);
strings::StrAppend(&dbg, ", prev: ", p->DebugString(a, false));
}
if (recurse && next != BFCAllocator::kInvalidChunkHandle) {
Chunk* n = a->ChunkFromHandle(next);
strings::StrAppend(&dbg, ", next: ", n->DebugString(a, false));
}
#ifdef TENSORFLOW_MEM_DEBUG
strings::StrAppend(&dbg, ", for: ", op_name ? op_name : "UNKNOWN",
", stepid: ", step_id,
", last_action: ", action_count);
#endif
return dbg;
}
};

(2) Bin

Bin is a collection of chunk memory blocks. Each bin has a fixed size. There are 21 in total, which increase from 256B to 512M in multiples of 2. The size of the internal chunk memory block is between the size of the current bin and the next bin.

  // A Bin is a collection of similar-sized free chunks.
// Allocated chunks are never in a Bin.
struct Bin {
// All chunks in this bin have >= bin_size memory.
size_t bin_size = 0;

class ChunkComparator {
public:
explicit ChunkComparator(BFCAllocator* allocator)
: allocator_(allocator) {}
// Sort first by size and then use pointer address as a tie breaker.
bool operator()(const ChunkHandle ha,
const ChunkHandle hb) const TF_NO_THREAD_SAFETY_ANALYSIS {
const Chunk* a = allocator_->ChunkFromHandle(ha);
const Chunk* b = allocator_->ChunkFromHandle(hb);
if (a->size != b->size) {
return a->size < b->size;
}
return a->ptr < b->ptr;
}

private:
BFCAllocator* allocator_; // The parent allocator
};

typedef std::set<ChunkHandle, ChunkComparator> FreeChunkSet;
// List of free chunks within the bin, sorted by chunk size.
// Chunk * not owned.
FreeChunkSet free_chunks;
Bin(BFCAllocator* allocator, size_t bs)
: bin_size(bs), free_chunks(ChunkComparator(allocator)) {}
};

2.2 Helper Class

The main function of the two helper classes AllocationRegion and RegionManager is to record the mapping relationship between the gpu memory pointer assigned to the user each time and the corresponding chunk to facilitate subsequent gpu memory recycling.

 // BFCAllocator allocates memory into a collection of disjoint
// AllocationRegions. Each AllocationRegion corresponds to one call to
// SubAllocator::Alloc(). (Actually, if a subsequent call to
// SubAllocator::Alloc() returns another region immediately adjacent to the
// last, it will be used to extend the first AllocationRegion, not create a
// separate one.)
//
// An AllocationRegion contains one or more Chunks, covering all of its
// memory. Its primary job is to map pointers to ChunkHandles.
//
// This class is thread-compatible.
class AllocationRegion {
public:
AllocationRegion(void* ptr, size_t memory_size)
: ptr_(ptr),
memory_size_(memory_size),
end_ptr_(
static_cast<void*>(static_cast<char*>(ptr_) + memory_size_)) {
DCHECK_EQ(0, memory_size % kMinAllocationSize);
const size_t n_handles =
(memory_size + kMinAllocationSize - 1) / kMinAllocationSize;
handles_.resize(n_handles, kInvalidChunkHandle);
}

AllocationRegion() = default;
AllocationRegion(AllocationRegion&& other) { Swap(&other); }
AllocationRegion& operator=(AllocationRegion&& other) {
Swap(&other);
return *this;
}

void* ptr() const { return ptr_; }
void* end_ptr() const { return end_ptr_; }
size_t memory_size() const { return memory_size_; }
void extend(size_t size) {
memory_size_ += size;
DCHECK_EQ(0, memory_size_ % kMinAllocationSize);

end_ptr_ = static_cast<void*>(static_cast<char*>(end_ptr_) + size);
const size_t n_handles =
(memory_size_ + kMinAllocationSize - 1) / kMinAllocationSize;
handles_.resize(n_handles, kInvalidChunkHandle);
}
ChunkHandle get_handle(const void* p) const {
return handles_[IndexFor(p)];
}
void set_handle(const void* p, ChunkHandle h) { handles_[IndexFor(p)] = h; }
void erase(const void* p) { set_handle(p, kInvalidChunkHandle); }

private:
void Swap(AllocationRegion* other) {
std::swap(ptr_, other->ptr_);
std::swap(memory_size_, other->memory_size_);
std::swap(end_ptr_, other->end_ptr_);
std::swap(handles_, other->handles_);
}

size_t IndexFor(const void* p) const {
std::uintptr_t p_int = reinterpret_cast<std::uintptr_t>(p);
std::uintptr_t base_int = reinterpret_cast<std::uintptr_t>(ptr_);
DCHECK_GE(p_int, base_int);
DCHECK_LT(p_int, base_int + memory_size_);
return static_cast<size_t>(((p_int - base_int) >> kMinAllocationBits));
}

// Metadata about the allocation region.
void* ptr_ = nullptr;
size_t memory_size_ = 0;
void* end_ptr_ = nullptr;

// Array of size "memory_size / kMinAllocationSize". It is
// indexed by (p-base) / kMinAllocationSize, contains ChunkHandle
// for the memory allocation represented by "p"
std::vector<ChunkHandle> handles_;

TF_DISALLOW_COPY_AND_ASSIGN(AllocationRegion);
};

// RegionManager aggregates one or more "AllocationRegions" and provides
// a layer of indirection from pointers to the underlying ChunkHandle,
// allowing allocation across multiple discontiguous memory regions.
//
// This class is thread-compatible.
class RegionManager {
public:
RegionManager() {}
~RegionManager() {}

void AddAllocationRegion(void* ptr, size_t memory_size) {
// Insert sorted by end_ptr.
auto entry =
std::upper_bound(regions_.begin(), regions_.end(), ptr, &Comparator);
regions_.insert(entry, AllocationRegion(ptr, memory_size));
}

// Adds an alloation region for the given ptr and size, potentially
// extending a region if ptr matches the end_ptr of an existing region.
// If a region is extended, returns a pointer to the extended region so that
// the BFC allocator can reason about chunkification.
AllocationRegion* AddOrExtendAllocationRegion(void* ptr,
size_t memory_size) {
// Insert sorted by end_ptr.
auto entry =
std::upper_bound(regions_.begin(), regions_.end(), ptr, &Comparator);
// Check if can be coalesced with preceding region.
if (entry != regions_.begin()) {
auto preceding_region = entry - 1;
if (preceding_region->end_ptr() == ptr) {
if (VLOG_IS_ON(1)) {
LOG(INFO) << "Extending region " << preceding_region->ptr()
<< " of "
<< strings::HumanReadableNumBytes(
preceding_region->memory_size())
<< " by " << strings::HumanReadableNumBytes(memory_size)
<< " bytes";
}
preceding_region->extend(memory_size);
return &*preceding_region;
}
}
VLOG(1) << "Inserting new region " << ptr << " of "
<< strings::HumanReadableNumBytes(memory_size);
regions_.insert(entry, AllocationRegion(ptr, memory_size));
return nullptr;
}

std::vector<AllocationRegion>::iterator RemoveAllocationRegion(
std::vector<AllocationRegion>::iterator it) {
return regions_.erase(it);
}

ChunkHandle get_handle(const void* p) const {
return RegionFor(p)->get_handle(p);
}

void set_handle(const void* p, ChunkHandle h) {
return MutableRegionFor(p)->set_handle(p, h);
}
void erase(const void* p) { return MutableRegionFor(p)->erase(p); }

const std::vector<AllocationRegion>& regions() const { return regions_; }

private:
static bool Comparator(const void* ptr, const AllocationRegion& other) {
return ptr < other.end_ptr();
}

AllocationRegion* MutableRegionFor(const void* p) {
return const_cast<AllocationRegion*>(RegionFor(p));
}

const AllocationRegion* RegionFor(const void* p) const {
auto entry =
std::upper_bound(regions_.begin(), regions_.end(), p, &Comparator);

if (entry != regions_.end()) {
return &(*entry);
}

LOG(FATAL) << "Could not find Region for " << p;
return nullptr;
}

private:
std::vector<AllocationRegion> regions_;
};

2.3 GPU Memory Allocation Process

BFC has two allocation modes, determined by the allow_growth parameter. When allow_growth is true, the allocation mode is incremental on-demand allocation, that is, the size of the gpu memory block is actually allocated according to the size of each gpu memory application (two conditions need to be met to open up gpu memory. One is that no free space is found in bin. chunk, the second is that the gpu memory pool’s applied gpu memory size plus the current gpu memory size that needs to be applied for does not exceed the total_memory size of the gpu memory pool). When allow_growth is false, the allocation mode is one-time package allocation, that is, it is allocated at one time according to the total_memory size of the gpu memory pool, and then divided among users according to the size of the gpu memory application.

struct Options {
bool allow_growth = true;

// If true, the allocator may sleep for a period of time when it can't
// fulfill an allocation request, in the hopes that another thread will free
// up memory in the meantime.
//
// If false, the allocator will never sleep, even if
// AllocationAttributes::attr_retry_on_failure is true.
bool allow_retry_on_failure = true;

// Whether the allocator will deallocate free regions to avoid OOM due to
// memory fragmentation.
bool garbage_collection = false;

// Controls when a chunk should be split, if its size exceeds the requested
// allocation size.
double fragmentation_fraction = 0;
};

The process of gpu memory allocation is to first search for the free chunk (FIndChunkPtr) in the Bin. If found, it will be judged whether the chunk needs to be split (SplitChunk). After the processing is completed, it will be returned to the user, if it is not found, there are two situations. Try when allow_growth is true. Reopen a new gpu memory block and return it to the user (Extend). If the development fails, OOM will be returned. If allow_growth is false, OOM will be returned.

void* BFCAllocator::AllocateRawInternal(size_t unused_alignment,
size_t num_bytes,
bool dump_log_on_failure,
uint64 freed_before) {
if (num_bytes == 0) {
VLOG(2) << "tried to allocate 0 bytes";
return nullptr;
}
// First, always allocate memory of at least kMinAllocationSize
// bytes, and always allocate multiples of kMinAllocationSize bytes
// so all memory addresses are nicely byte aligned.
size_t rounded_bytes = RoundedBytes(num_bytes);

// The BFC allocator tries to find the best fit first.
BinNum bin_num = BinNumForSize(rounded_bytes);

mutex_lock l(lock_);
if (!timestamped_chunks_.empty()) {
// Merge timestamped chunks whose counts have become safe for general use.
MergeTimestampedChunks(0);
}
void* ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}

// Try to extend
if (Extend(unused_alignment, rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}

if ((freed_before == 0) && (!timestamped_chunks_.empty())) {
// We're unable to satisfy an allocation request without a specific
// timestamp requirement. Rather than fail, try merging any held-out
// timestamped chunks more aggressively until a free chunk of the necessary
// size is formed.
if (MergeTimestampedChunks(rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}
}

// Reaching this point means that no chunks can satisfy the request. Also,
// the unallocated bytes cannot satisfy the request. Before giving up, let's
// try deallocating free regions so that suballocator can combine them with
// the unallocated bytes and form a larger region.
if (DeallocateFreeRegions(rounded_bytes) &&
Extend(unused_alignment, rounded_bytes)) {
ptr = FindChunkPtr(bin_num, rounded_bytes, num_bytes, freed_before);
if (ptr != nullptr) {
AddTraceMe("MemoryAllocation", ptr);
return ptr;
}
}

// We searched all bins for an existing free chunk to use and
// couldn't find one. This means we must have run out of memory,
// Dump the memory log for analysis.
MaybeWriteMemoryMap();
if (dump_log_on_failure) {
LOG(WARNING)
<< "Allocator (" << Name() << ") ran out of memory trying "
<< "to allocate " << strings::HumanReadableNumBytes(num_bytes)
<< " (rounded to " << rounded_bytes << ")"
<< "requested by op "
<< tensorflow::profiler::ScopedMemoryDebugAnnotation::
CurrentAnnotation()
.pending_op_name
<< "\nIf the cause is memory fragmentation maybe the environment "
<< "variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will "
<< "improve the situation. \nCurrent allocation summary follows."
<< "\nCurrent allocation summary follows.";
DumpMemoryLog(rounded_bytes);
LOG(WARNING) << RenderOccupancy();
}
return nullptr;
}

2.4 GPU Memory Recycling Process

When recycling, first find the corresponding chunk according to the incoming gpu memory pointer (region_manager_.get_handle), and then determine whether it needs to be merged with the previous and previous chunks (TryToCoalesce). After the processing is completed, put the chunk into the corresponding bin (InsertFreeChunkIntoBin).

void BFCAllocator::DeallocateRawInternal(void* ptr) {
if (ptr == nullptr) {
VLOG(2) << "tried to deallocate nullptr";
return;
}
mutex_lock l(lock_);

// Find the chunk from the ptr.
BFCAllocator::ChunkHandle h = region_manager_.get_handle(ptr);
CHECK(h != kInvalidChunkHandle);
// Record chunk information before it's freed.
Chunk* chunk = ChunkFromHandle(h);
void* chunk_ptr = chunk->ptr;
int64_t req_bytes = chunk->requested_size;
int64_t alloc_bytes = chunk->size;

MarkFree(h);

// Consider coalescing it.
if (timing_counter_) {
InsertFreeChunkIntoBin(h);
timestamped_chunks_.push_back(h);
} else {
InsertFreeChunkIntoBin(TryToCoalesce(h, false));
}

// TraceMe needs to be added after MarkFree and InsertFreeChunkIntoBin for
// correct aggregation stats (bytes_in_use, fragmentation).
AddTraceMe("MemoryDeallocation", chunk_ptr, req_bytes, alloc_bytes);

if (VLOG_IS_ON(4)) {
LOG(INFO) << "F: " << RenderOccupancy();
}
}

3 Other

In general, under the premise of understanding the memory pool, you can find that the principle of tensorflow BFC memory pool is relatively simple and the source code difficulty is relatively low. In addition, the minimalist code implementation of BFC can refer to memory_pool.

--

--