Nvidia Tensor Core-Preliminary Exploration

What is Tensor Core?

5 min readSep 25, 2023

1 Background

In the field of image processing based on deep learning convolutional networks, the calculation-intensive convolution operator has always been the focus of engineering optimization, and convolution calculations are generally converted into matrix multiplication operations, so optimizing matrix multiplication operations naturally becomes a deep learning framework. One of the most concerned optimization directions. In view of this, Nvidia officially provides a set of hardware solutions, namely Tensor Core, which can accelerate matrix multiplication operations, achieve mixed-precision calculations, and improve throughput while maintaining accuracy.

2 Hardware Unit

Like CUDA Core, Tensor Core is also an arithmetic unit that specializes in matrix multiplication operations. The following figure shows the SM internal structure diagram of Turing TU102/TU104/TU106. It is divided into 4 processing blocks. Each processing block contains 16 FP32 Cores, 16 INT32 Cores, 2 Tensor Cores, 1 Warp Scheduler and 1 Dispatch. Unit.

3 Architecture

Since the Volta architecture launched the first generation of Tensor Core, Tensor Core has been greatly improved in each subsequent generation of architecture upgrades, and the number of supported data types has gradually increased.

3.1 Volta Tensor Core

The first-generation Tensor Core supports mixed-precision matrix multiplication under FP16 and FP32, providing more than 100 trillion operations per second (TFLOPS) of deep learning performance, which is more than 5 times that of the Pascal architecture. Compared to Pascal, peak teraFLOPS (TFLOPS) performance for training is improved by up to 12 times, peak TFLOPS performance for inference is improved by up to 6 times, and training and inference performance is improved by 3 times.

3.2 Turing Tensor Core

The second-generation Tensor Core offers a range of precisions for deep learning training and inference (from FP32 to FP16 to INT8 and INT4), delivering up to 500 trillion tensor operations per second.

3.3 Ampere Tensor Core

The third-generation Tensor Core uses new precision standards Tensor Float 32 (TF32) and 64-bit floating point (FP64) to accelerate and simplify artificial intelligence applications, increasing the speed of artificial intelligence up to 20 times.

3.4 Hopper Tensor Core

The fourth generation Tensor Core uses new 8-bit floating point precision (FP8) to provide 6 times higher performance than FP16 for trillion-parameter model training. FP8 is used in the Transformer engine and can apply the mixed precision mode of FP8 and FP16 to greatly accelerate Transformer training while taking into account accuracy. FP8 can also significantly improve the speed of large language model inference, with performance up to 30 times higher than Ampere.

4 Call

In addition to using APIs in the CUDA ecological library to call Tensor Core, such as cublas, cudnn, etc., Nvidia also provides the following ways to call Tensor Core.

4.1 WMMA (Warp-level Matrix Multiply Accumulate) API

For CUDA devices with computing capabilities of 7.0 and above, you can use the CUDA C++ API to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.

template<typename Use, int m, int n, int k, typename T, typename Layout=void> class fragment;

void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm);
void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm, layout_t layout);
void store_matrix_sync(T* mptr, const fragment<...> &a, unsigned ldm, layout_t layout);
void fill_fragment(fragment<...> &a, const T& v);
void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c, bool satf=false);

fragment: Tensor Core data storage class supports matrix_a, matrix_b and accumulator
load_matrix_sync: Tensor Core data loading API supports loading matrix data from global memory or shared memory to fragment
store_matrix_sync: Tensor Core result storage API supports storing calculation results from fragments to global memory or shared memory
fill_fragment: fragment filling API, supports constant value filling
mma_sync: Tensor Core matrix multiplication calculation API supports D = AB + C or C = AB + C

4.2 WMMA PTX (Parallel Thread Execution)

For CUDA devices with computing capabilities of 7.0 and above, you can also use the WMMA PTX instruction to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.

wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride};
wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride};
wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride};

wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride};

wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c;

wmma.load: Tensor Core data loading instructions support loading matrix data from global memory or shared memory to Tensor Core registers
wmma.store: Tensor Core result storage instructions support storing calculation results from Tensor Core registers to global memory or shared memory
wmma.mma: Tensor Core matrix multiplication calculation instructions support D = AB + C or C = AB + C

4.3 MMA (Matrix Multiply Accumulate) PTX

For CUDA devices with computing capabilities of 7.0 and above, you can also use the MMA PTX instruction to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.

ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];

mma.sync.aligned.m8n8k4.alayout.blayout.dtype.f16.f16.ctype  d, a, b, c;
mma.sync.aligned.m16n8k8.row.col.dtype.f16.f16.ctype  d, a, b, c;
mma.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype d, a, b, c;

ldmatrix: Tensor Core data loading instructions support loading matrix data from shared memory to Tensor Core registers
mma: Tensor Core matrix multiplication calculation instructions support D = AB + C or C = AB + C

4.4 SASS

Learn based on the SASS instruction set.