# Nvidia Tensor Core-CUDA HGEMM Advanced Optimization

## How to extremely optimize CUDA HGEMM with Tensor Core?

# 1 Background

GEMM (General Matrix Multiplication) matrix multiplication is one of the most commonly used and time-consuming algorithms in deep learning, especially in the fields of CNN, RNN, transformer and other fields. In these fields, a large number of matrix multiplication operations need to be calculated and processed quickly. Therefore, efficient matrix multiplication implementation is crucial for the performance and accuracy of deep learning tasks.

HGEMM (Half-precision General Matrix Multiplication) with the support of the Tensor Core hardware unit on Nvidia GPU, can greatly increase the calculation speed while maintaining accuracy. The resulting performance advantages can significantly improve deep learning. The speed of implementation of inference and training tasks.

The emergence of Tensor Core has brought breakthrough progress to the optimization of HGEMM. Using Tensor Core to optimize the HGEMM on Nvidia GPU has become one of the hot research areas in GPU accelerated computing in recent years.

# 2 Result

This article mainly uses handwritten WMMA API and MMA PTX CUDA HGEMM kernel to call Tensor Core, then performs performance tuning, and compares it with cublas’ Tensor Core performance. By exploring various matrix tiling and optimization methods, currently, the performance between 256 ~ 16384 dimensions is not lower than 95% of cublas performance, and the performance exceeds cublas in many scenarios. The code is open sourced in cuda_hgemm.

## 2.1 Test Conditions

- HGEMM: C (M * N, Half, Row Major) = A (M * K, Half, Row Major) * B (K * N, Half, Col Major)
- CUDA: 11.3
- GPU: RTX3090、RTX A6000

## 2.2 Equipment Specifications

The device specifications of RTX3090 and RTX A6000 are as follows.

## 2.3 RTX3090

## 2.4 RTX A6000

# 3 Matrix Tiling

## 3.1 Block Tiling

For a fixed-size input matrix, usually the larger the block tiling size, the greater the calculation amount within a single block, but the smaller the number of blocks that need to be calculated in parallel. This is a trade-off between the block dimension calculation amount and the degree of parallelism.

Generally speaking, for input matrices of different sizes, different blocking strategies need to be adopted to better achieve a balance between block dimension calculation volume and parallelism. The smaller the size of the input matrix, the smaller the block tiling size. The larger the size of the input matrix, the larger the block tiling size. On the other hand, combined with the limitations of hardware specifications, the block tiling size is generally a combination between 32, 64, 128, and 256.

## 3.2 Warp Tiling

After determining the block tiling size, you need to continue to determine the warp tiling size. Generally speaking, the larger the warp tiling size, the greater the calculation amount within a single warp, but the smaller the number of warps that need to be calculated in parallel. This is the trade-off between warp dimension calculation amount and parallelism.

If the number of warps is too small, warp occupancy will be too low, that is, the number of eligible warps in the scheduler is too small, which may seriously affect the instruction issuance cycle inside the scheduler, and ultimately affect the performance of the kernel. On the other hand, combined with the limitations of hardware specifications, the number of warps is generally 4, 8, 16, and combined with the block tiling size, the optimal warp tiling size can be determined, generally a combination between 8, 16, 32, 64, and 128.

# 4 Memory Access Optimization

## 4.1 Wide Instruction Memory Access

The data of A matrix and B matrix are loaded from global memory to shared memory in 16Bytes (int4 or float4) as much as possible to reduce the number of loading instructions, improve the calculation memory access ratio of instructions, and reduce the delay of instruction issuance.

## 4.2 Data Reuse

As shown in the figure below, when block calculates the green block of the C matrix, it needs the entire light blue block of the A matrix and the entire light yellow block of the B matrix. For different warps in the block, when different warps calculate the same row or column, the data of matrix A or matrix B that needs to be loaded is the same, that is, there may be repeated loading of data of matrix A or matrix B inside the block.

The solution is to first load the data of the A matrix or B matrix into the shared memory to achieve sharing within the block, and then the warp in the block loads the data from the shared memory to the register for calculation. Since the access latency of shared memory is much smaller than that of global memory, it can significantly alleviate the bandwidth bottleneck of kernel.

## 4.3 Asynchronous Copy

The cp.async is a non-blocking asynchronous copy PTX instruction that can copy data from global memory to shared memory. It not only has a certain cache control strategy, but can also independently control multiple groups of emission instructions. To a certain extent, the optimization scope of data migration and the pipeline parallel optimization strategy are expanded.

`cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}`

[dst], [src], cp-size{, src-size}{, cache-policy} ;

cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}

[dst], [src], 16{, src-size}{, cache-policy} ;

cp.async.ca.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}

[dst], [src], cp-size{, ignore-src}{, cache-policy} ;

cp.async.cg.shared{::cta}.global{.level::cache_hint}{.level::prefetch_size}

[dst], [src], 16{, ignore-src}{, cache-policy} ;

.level::cache_hint = { .L2::cache_hint }

.level::prefetch_size = { .L2::64B, .L2::128B, .L2::256B }

cp-size = { 4, 8, 16 }

## 4.4 Eliminate Bank Conflict

Generally speaking, the shared memory of Nvidia GPU is divided into 32 banks, each bank defaults to 4Bytes, totaling 128Bytes. Normally, when a thread in a single warp accesses different banks of shared memory, only one video memory request is required. If a thread in a single warp accesses different fields in the same bank of shared memory, bank conflict will occur, and the memory request will become serial execution, significantly increasing the memory access delay.

Therefore, eliminating the bank conflict of shared memory is almost the basic operation of all CUDA kernel optimization, usually using padding or permuted methods.

**(1) Padding**

As shown in the figure below, assuming that the blue area is the shared memory area to be accessed by a single warp, if it is accessed directly, an obvious bank conflict will occur.

If when applying for shared memory, each row applies for 4 additional yellow areas of the bank. At this time, the shared memory area to be accessed by a single warp happens to be in a different bank, thus avoiding the occurrence of bank conflict.

The padding method can effectively solve the bank conflict problem of shared memory, but it will apply for additional shared memory and increase the usage of shared memory inside the block.

**(2) Permuted**

Can the bank conflict problem be solved without applying for additional shared memory? The answer is yes.

As shown in the figure below, assuming that matrix A is first loaded from global memory to shared memory, the same color area represents a memory access transaction of a single warp. If the corresponding location is stored, and then the ldmatrix PTX instruction is used to load it from shared memory to the register, obvious bank conflict will occur. If you perform a permuted operation when loading into shared memory, and then use the ldmatrix PTX instruction to load from shared memory into a register, bank conflict will be avoided.

## 4.5 Improve L2 Cache Hit Rate

The block division problem was mentioned earlier. If the block division has been completed, how should the calculation sequence of these block divisions be designed during actual calculations. The most direct way is to calculate by row. This calculation method will cause a problem and significantly reduce the L2 cache hit rate.

Generally speaking, for the C matrix block in the same row, the required block data of the A matrix is the same. Similarly, for the C matrix block of the same column, the required block data of the B matrix Are the same. If calculated by rows, for matrix A, the data is the same, and the L2 cache hit rate is very high. But for matrix B, the data is different, and the L2 cache hit rate is very low. Taken together, considering the capacity of the L2 cache , the L2 cache hit rate will not be too high.

Therefore, the swizzle calculation method as shown in the figure below can be adopted, that is, the “cattle farming” calculation, taking into account the L2 cache hit rate of the A matrix and the B matrix, and improving the overall L2 cache hit rate. At the same time, the block size and L2 cache capacity can be combined to adjust the “cattle farming” step size to obtain the highest L2 cache hit rate.

## 4.6 Improve Register Reuse Rate

The warp tiling problem was mentioned earlier. If the warp tiling has been completed, then how to design the calculation sequence of the warp internal tile (Tensor Core calculation size) during actual calculation. Here, the cutlass source code provides an idea.

For devices with computing capabilities of 8.0 and above, adopt the “Right Left Right Left” calculation method to improve the reuse rate of the A matrix register.

For devices with computing capabilities below 8.0, adopt the “Down Up Down Up” calculation method to improve the reuse rate of B matrix registers.

# 5 Pipeline Optimization

Generally speaking, the calculation process of CUDA HGEMM is to first copy the block data of A matrix and B matrix from global memory to shared memory in batches according to K dimensions, and then copy the data in shared memory to the register batch by batch for matrix multiplication. Calculate until all data is completed matrix multiplication calculation, the schematic diagram is as follows.

## 5.1 Double Buffer

Double buffer is a data prefetching method and is also one of the commonly used methods in GEMM optimization. It mainly applies for two shared memory buffers and loads data alternately. When one of the buffers is loading data, the other buffer has completed data loading and can perform matrix multiplication calculations. In this way, data loading and matrix multiplication calculations can be parallelized and the delay in data loading can be hidden. As shown in the figure below, the workflow of double buffer is shown. Light blue is the data loading buffer, and dark blue is the matrix multiplication calculation buffer. The two pipelines can be executed in parallel.

## 5.2 Stage

On the basis of double buffer, think about this problem. If in step 1, data loading and matrix multiplication calculation are executed in parallel, but the matrix multiplication calculation is executed first and the data loading is not completed, then the Tensor Core computing unit will appear. When waiting for data, that is, the SM will be idle, and the overall utilization of the SM will be low. How to solve this problem is how to prevent the Tensor Core computing unit from waiting for data.

Since it is a bandwidth bottleneck problem, more shared memory buffers can be introduced. Data loading of multiple buffers can be completed during prefetching, thus avoiding the situation where the Tensor Core computing unit is waiting for data. The figure below shows the workflow when stage is 3. The data loading of each step is parallel to the matrix multiplication calculation of two steps, further hiding the delay of data loading. Similarly, when the shared memory is sufficient, the stage can continue to increase until the delay in data loading can be fully hidden. Therefore, the choice of stage depends on the device bandwidth and Tensor Core computing power.

# 6 Other

## 6.1 Optimization Method

This article mainly introduces the general optimization methods and some special optimization methods of CUDA HGEMM. Later, we may talk about the optimization experience in detail on a certain optimization point. For different GPU and CUDA versions, the optimization strategies to achieve optimal performance are different.

## 6.2 Source Code

All optimization methods used in this article are open sourced in cuda_hgemm, including the implementation code of WMMA API and MMA PTX. The block tiling size (256*128) and warp tiling size (64*64) are fixed, and may be analyzed together with the source code in the future.