Nvidia Tensor Core-Preliminary Exploration
What is Tensor Core?
1 Background
In the field of image processing based on deep learning convolutional networks, the calculation-intensive convolution operator has always been the focus of engineering optimization, and convolution calculations are generally converted into matrix multiplication operations, so optimizing matrix multiplication operations naturally becomes a deep learning framework. One of the most concerned optimization directions. In view of this, Nvidia officially provides a set of hardware solutions, namely Tensor Core, which can accelerate matrix multiplication operations, achieve mixed-precision calculations, and improve throughput while maintaining accuracy.
2 Hardware Unit
Like CUDA Core, Tensor Core is also an arithmetic unit that specializes in matrix multiplication operations. The following figure shows the SM internal structure diagram of Turing TU102/TU104/TU106. It is divided into 4 processing blocks. Each processing block contains 16 FP32 Cores, 16 INT32 Cores, 2 Tensor Cores, 1 Warp Scheduler and 1 Dispatch. Unit.
3 Architecture
Since the Volta architecture launched the first generation of Tensor Core, Tensor Core has been greatly improved in each subsequent generation of architecture upgrades, and the number of supported data types has gradually increased.
3.1 Volta Tensor Core
The first-generation Tensor Core supports mixed-precision matrix multiplication under FP16 and FP32, providing more than 100 trillion operations per second (TFLOPS) of deep learning performance, which is more than 5 times that of the Pascal architecture. Compared to Pascal, peak teraFLOPS (TFLOPS) performance for training is improved by up to 12 times, peak TFLOPS performance for inference is improved by up to 6 times, and training and inference performance is improved by 3 times.
3.2 Turing Tensor Core
The second-generation Tensor Core offers a range of precisions for deep learning training and inference (from FP32 to FP16 to INT8 and INT4), delivering up to 500 trillion tensor operations per second.
3.3 Ampere Tensor Core
The third-generation Tensor Core uses new precision standards Tensor Float 32 (TF32) and 64-bit floating point (FP64) to accelerate and simplify artificial intelligence applications, increasing the speed of artificial intelligence up to 20 times.
3.4 Hopper Tensor Core
The fourth generation Tensor Core uses new 8-bit floating point precision (FP8) to provide 6 times higher performance than FP16 for trillion-parameter model training. FP8 is used in the Transformer engine and can apply the mixed precision mode of FP8 and FP16 to greatly accelerate Transformer training while taking into account accuracy. FP8 can also significantly improve the speed of large language model inference, with performance up to 30 times higher than Ampere.
4 Call
In addition to using APIs in the CUDA ecological library to call Tensor Core, such as cublas, cudnn, etc., Nvidia also provides the following ways to call Tensor Core.
4.1 WMMA (Warp-level Matrix Multiply Accumulate) API
For CUDA devices with computing capabilities of 7.0 and above, you can use the CUDA C++ API to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.
template<typename Use, int m, int n, int k, typename T, typename Layout=void> class fragment;
void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm);
void load_matrix_sync(fragment<...> &a, const T* mptr, unsigned ldm, layout_t layout);
void store_matrix_sync(T* mptr, const fragment<...> &a, unsigned ldm, layout_t layout);
void fill_fragment(fragment<...> &a, const T& v);
void mma_sync(fragment<...> &d, const fragment<...> &a, const fragment<...> &b, const fragment<...> &c, bool satf=false);
- fragment: Tensor Core data storage class supports matrix_a, matrix_b and accumulator
- load_matrix_sync: Tensor Core data loading API supports loading matrix data from global memory or shared memory to fragment
- store_matrix_sync: Tensor Core result storage API supports storing calculation results from fragments to global memory or shared memory
- fill_fragment: fragment filling API, supports constant value filling
- mma_sync: Tensor Core matrix multiplication calculation API supports D = AB + C or C = AB + C
4.2 WMMA PTX (Parallel Thread Execution)
For CUDA devices with computing capabilities of 7.0 and above, you can also use the WMMA PTX instruction to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.
wmma.load.a.sync.aligned.layout.shape{.ss}.atype r, [p] {, stride};
wmma.load.b.sync.aligned.layout.shape{.ss}.btype r, [p] {, stride};
wmma.load.c.sync.aligned.layout.shape{.ss}.ctype r, [p] {, stride};
wmma.store.d.sync.aligned.layout.shape{.ss}.type [p], r {, stride};
wmma.mma.sync.aligned.alayout.blayout.shape.dtype.ctype d, a, b, c;
- wmma.load: Tensor Core data loading instructions support loading matrix data from global memory or shared memory to Tensor Core registers
- wmma.store: Tensor Core result storage instructions support storing calculation results from Tensor Core registers to global memory or shared memory
- wmma.mma: Tensor Core matrix multiplication calculation instructions support D = AB + C or C = AB + C
4.3 MMA (Matrix Multiply Accumulate) PTX
For CUDA devices with computing capabilities of 7.0 and above, you can also use the MMA PTX instruction to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.
ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];
mma.sync.aligned.m8n8k4.alayout.blayout.dtype.f16.f16.ctype d, a, b, c;
mma.sync.aligned.m16n8k8.row.col.dtype.f16.f16.ctype d, a, b, c;
mma.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype d, a, b, c;
- ldmatrix: Tensor Core data loading instructions support loading matrix data from shared memory to Tensor Core registers
- mma: Tensor Core matrix multiplication calculation instructions support D = AB + C or C = AB + C
4.4 SASS
Learn based on the SASS instruction set.