Nvidia Tensor Core-Getting Started with MMA PTX Programming

How to program using MMA PTX?

Bruce-Lee-LY
13 min readSep 25, 2023

1 PTX (Parallel Thread Execution)

What is PTX? Nvidia officially describes it as a low-level parallel thread execution virtual machine and instruction set architecture (ISA), which means low-level parallel thread execution virtual machine and instruction set architecture (ISA).

There are two ways to understand its direct meaning.

One way is to learn from LLVM. Those who are familiar with LLVM know that its full name is Low Level Virtual Machine. We do not care about the naming of LLVM’s main project and its underlying virtual machine. We mainly focus on LLVM’s core concept IR (Intermediate Representation) and its behavior. Somewhat similar to PTX. IR connects the front-end programming language and the back-end target code. Not only can it be easier to implement new programming languages, but it can also easily generate target code on different hardware platforms. It can also do some general compilation optimization and runtime optimization. . PTX is inherited from the GPU programming language CUDA C++. It enables the GPU hardware SASS instructions and can use NVRTC to achieve runtime optimization. At some levels, it can be called GPU device-independent code, so PTX can be understood as “CUDA IR”.

Another way is not to understand it too much. After all, the starting point of Nvidia’s closed source is to make developers rarely confused.
Going back to PTX itself, I am used to CUDA C++ programming. It seems that I have never seen PTX, but it has always been there. The figure below shows the process of compiling CUDA with NVCC. It can be found that the compilation of .cu files is divided into two parts. One part is to compile the host code, and the other part is to compile the device code. The .ptx file will be generated during the programming process of the device code, and Usually the focus is on the final product produced by compilation. The compilation process of NVCC will not be discussed here. We will talk about it later when we have the opportunity.

2 MMA (Matrix Multiply Accumulate) PTX

For CUDA devices with computing capabilities of 7.0 and above, you can use the MMA PTX instruction to call Tensor Core, which supports mixed-precision matrix multiplication operations in the form of D = AB + C.

mma.sync.aligned.m8n8k4.alayout.blayout.dtype.f16.f16.ctype  d, a, b, c;
mma.sync.aligned.m16n8k8.row.col.dtype.f16.f16.ctype d, a, b, c;
mma.sync.aligned.m16n8k16.row.col.dtype.f16.f16.ctype d, a, b, c;

Taking m16n8k16 FP16 as an example, the calculation distribution of elements in each tile on the threads in the warp is shown in the figure below. It can be clearly found that the fragments calculated by each thread are discontinuous.

The indexes of the rows and columns of the matrix A fragment can be calculated as follows:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = groupID for ai where 0 <= i < 2 || 4 <= i < 6
groupID + 8 Otherwise

col = (threadID_in_group * 2) + (i & 0x1) for ai where i < 4
(threadID_in_group * 2) + (i & 0x1) + 8 for ai where i >= 4

The indexes of the rows and columns of the matrix B fragment can be calculated as follows:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = (threadID_in_group * 2) + (i & 0x1) for bi where i < 2
(threadID_in_group * 2) + (i & 0x1) + 8 for bi where i >= 2

col = groupID

The indexes of the rows and columns of the matrix C or D fragment can be calculated as follows:

groupID           = %laneid >> 2
threadID_in_group = %laneid % 4

row = groupID for ci where i < 2
groupID + 8 for ci where i >= 2

col = (threadID_in_group * 2) + (i & 0x1) for ci where i = {0,..,3}

3 LDMATRIX PTX

Because when the MMA PTX instruction calculates tiles, the fragments calculated by the threads in the warp are discontinuous and the index calculation is more complicated, so Nvidia provides the LDMATRIX PTX instruction to cooperate with the MMA PTX instruction.

ldmatrix.sync.aligned.shape.num{.trans}{.ss}.type r, [p];

.shape = {.m8n8};
.num = {.x1, .x2, .x4};
.ss = {.shared};
.type = {.b16};

LDMATRIX PTX is a warp-level data loading instruction that reads consecutive rows without being continuously stored in memory. The 8 addresses required for each matrix are provided by 8 threads, depending on the value of .num. Each address corresponds to the beginning of a matrix row. The addresses addr0-addr7 correspond to the rows of the first matrix, the addresses addr8-addr15 correspond to the rows of the second matrix, and so on, as shown in the table below.

When reading an 8x8 matrix, a contiguous set of four threads loads 16 bytes. Matrix addresses must be aligned accordingly. Each thread in the warp loads a row of fragments, and thread 0 receives the first fragment in register r, and so on. A group of four threads will load an entire row of the matrix, as shown in the table below. It can be found that the data distribution of the LDMATRIX PTX instruction on the threads in the warp is consistent with the MMA PTX instruction.

It is worth noting that first, the LDMATRIX PTX instruction can only load data from shared memory; secondly, for CUDA devices with computing capabilities of sm_75 and below, all threads in the LDMATRIX PTX instruction must contain valid addresses. Otherwise, the behavior is undefined. When .num is .x1 and .x2, addresses contained in the lower thread can be copied to the higher thread to achieve the expected behavior.

4 Example

Talk is cheap, show me the code. Similar to the introduction to Nvidia Tensor Core-WMMA API programming, take m16n8k16 as an example to implement HGEMM: C = AB, where the matrices A (M * K, row major), B (K * N, col major) and C (M * N , row major) are all FP16.

The programming idea of MMA PTX is similar to the WMMA API, and the naive kernel is built according to the idea that each warp processes a tile of matrix C. First determine the tile coordinates of the current warp processing matrix C, declare the shared memory and registers required to calculate tilie, then traverse K with MMA_K as the step size and load the required A and B matrix tiles from global memory via shared memory to the registers by LDMATRIX PTX Participate in the calculation, and finally write the calculation result back to matrix C from the register through shared memory. After all block calculations are completed, the matrix C can be obtained. This example is difficult, but not much. The source code is in cuda_hgemm.

#define MMA_M 16
#define MMA_N 8
#define MMA_K 16

#define WARP_SIZE 32

__global__ void mmaNaiveKernel(const half *__restrict__ A, const half *__restrict__ B, half *__restrict__ C, size_t M,
size_t N, size_t K) {
const size_t K_tiles = div_ceil(K, MMA_K);

const size_t warp_row = blockIdx.y * MMA_M;
const size_t warp_col = blockIdx.x * MMA_N;

if (warp_row >= M || warp_col >= N) {
return;
}

__shared__ half A_shmem[MMA_M][MMA_K];
__shared__ half B_shmem[MMA_N][MMA_K];
__shared__ half C_shmem[MMA_M][MMA_N];

const size_t lane_id = threadIdx.x % WARP_SIZE;

uint32_t RC[2] = {0, 0};

#pragma unroll
for (size_t i = 0; i < K_tiles; ++i) {
*((int4 *)(&A_shmem[lane_id / 2][0]) + lane_id % 2) =
*((int4 *)(&A[(warp_row + lane_id / 2) * K + i * MMA_K]) + lane_id % 2);

if (lane_id < MMA_N * 2) {
*((int4 *)(&B_shmem[lane_id / 2][0]) + lane_id % 2) =
*((int4 *)(&B[i * MMA_K + (warp_col + lane_id / 2) * K]) + lane_id % 2);
}

__syncthreads();

uint32_t RA[4];
uint32_t RB[2];

uint32_t A_shmem_lane_addr = __cvta_generic_to_shared(&A_shmem[lane_id % 16][(lane_id / 16) * 8]);
LDMATRIX_X4(RA[0], RA[1], RA[2], RA[3], A_shmem_lane_addr);

uint32_t B_shmem_lane_addr = __cvta_generic_to_shared(&B_shmem[lane_id % 8][((lane_id / 8) % 2) * 8]);
LDMATRIX_X2(RB[0], RB[1], B_shmem_lane_addr);

HMMA16816(RC[0], RC[1], RA[0], RA[1], RA[2], RA[3], RB[0], RB[1], RC[0], RC[1]);

__syncthreads();
}

*((uint32_t *)(&C_shmem[lane_id / 4][0]) + lane_id % 4) = RC[0];
*((uint32_t *)(&C_shmem[lane_id / 4 + 8][0]) + lane_id % 4) = RC[1];

__syncthreads();

if (lane_id < MMA_M) {
*((int4 *)(&C[(warp_row + lane_id) * N + warp_col])) = *((int4 *)(&C_shmem[lane_id][0]));
}
}

void mmaNaive(half *A, half *B, half *C, size_t M, size_t N, size_t K) {
dim3 block(WARP_SIZE);
dim3 grid(div_ceil(N, MMA_N), div_ceil(M, MMA_M));

mmaNaiveKernel<<<grid, block>>>(A, B, C, M, N, K);
}

5 Underlying Code

Let’s further explore the above MMA naive kernel and take a look at its corresponding SASS on RTX A6000 (sm_86, CUDA 11.3).

 Function : _Z14mmaNaiveKernelPK6__halfS1_PS_mmm
.headerflags @"EF_CUDA_SM86 EF_CUDA_PTX_SM(EF_CUDA_SM86)"
/*0000*/ IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] ; /* 0x00000a00ff017624 */
/* 0x000fc400078e00ff */
/*0010*/ S2R R2, SR_CTAID.X ; /* 0x0000000000027919 */
/* 0x000e280000002500 */
/*0020*/ S2R R0, SR_CTAID.Y ; /* 0x0000000000007919 */
/* 0x000e620000002600 */
/*0030*/ IMAD.SHL.U32 R2, R2, 0x8, RZ ; /* 0x0000000802027824 */
/* 0x001fe400078e00ff */
/*0040*/ IMAD.SHL.U32 R0, R0, 0x10, RZ ; /* 0x0000001000007824 */
/* 0x002fc600078e00ff */
/*0050*/ ISETP.GE.U32.AND P0, PT, R2, c[0x0][0x180], PT ; /* 0x0000600002007a0c */
/* 0x000fe40003f06070 */
/*0060*/ ISETP.GE.U32.AND P1, PT, R0, c[0x0][0x178], PT ; /* 0x00005e0000007a0c */
/* 0x000fe40003f26070 */
/*0070*/ ISETP.GE.U32.AND.EX P0, PT, RZ, c[0x0][0x184], PT, P0 ; /* 0x00006100ff007a0c */
/* 0x000fc80003f06100 */
/*0080*/ ISETP.GE.U32.OR.EX P0, PT, RZ, c[0x0][0x17c], P0, P1 ; /* 0x00005f00ff007a0c */
/* 0x000fda0000706510 */
/*0090*/ @P0 EXIT ; /* 0x000000000000094d */
/* 0x000fea0003800000 */
/*00a0*/ S2R R9, SR_TID.X ; /* 0x0000000000097919 */
/* 0x000e220000002100 */
/*00b0*/ IMAD.MOV.U32 R10, RZ, RZ, c[0x0][0x188] ; /* 0x00006200ff0a7624 */
/* 0x000fe200078e00ff */
/*00c0*/ ULDC.64 UR4, c[0x0][0x118] ; /* 0x0000460000047ab9 */
/* 0x000fe20000000a00 */
/*00d0*/ IMAD.MOV.U32 R5, RZ, RZ, 0x4 ; /* 0x00000004ff057424 */
/* 0x000fe200078e00ff */
/*00e0*/ CS2R R16, SRZ ; /* 0x0000000000107805 */
/* 0x000fe2000001ff00 */
/*00f0*/ IMAD.MOV.U32 R8, RZ, RZ, c[0x0][0x18c] ; /* 0x00006300ff087624 */
/* 0x000fe200078e00ff */
/*0100*/ LOP3.LUT P0, RZ, R10.reuse, 0xf, RZ, 0xc0, !PT ; /* 0x0000000f0aff7812 */
/* 0x040fe4000780c0ff */
/*0110*/ SHF.R.U64 R10, R10, R5, c[0x0][0x18c] ; /* 0x000063000a0a7619 */
/* 0x000fe40000001205 */
/*0120*/ LOP3.LUT P0, RZ, RZ, c[0x0][0x18c], RZ, 0xc0, P0 ; /* 0x00006300ffff7a12 */
/* 0x000fc4000000c0ff */
/*0130*/ SHF.R.U32.HI R8, RZ, 0x4, R8 ; /* 0x00000004ff087819 */
/* 0x000fe40000011608 */
/*0140*/ SEL R7, RZ, 0xffffffff, !P0 ; /* 0xffffffffff077807 */
/* 0x000fc80004000000 */
/*0150*/ ISETP.NE.U32.AND P1, PT, R10, R7, PT ; /* 0x000000070a00720c */
/* 0x000fc80003f25070 */
/*0160*/ ISETP.NE.AND.EX P1, PT, R8, R7, PT, P1 ; /* 0x000000070800720c */
/* 0x000fe40003f25310 */
/*0170*/ LOP3.LUT R19, R9.reuse, 0x1f, RZ, 0xc0, !PT ; /* 0x0000001f09137812 */
/* 0x041fe200078ec0ff */
/*0180*/ IMAD.SHL.U32 R5, R9, 0x4, RZ ; /* 0x0000000409057824 */
/* 0x000fc600078e00ff */
/*0190*/ ISETP.GT.U32.AND P2, PT, R19.reuse, 0xf, PT ; /* 0x0000000f1300780c */
/* 0x040fe20003f44070 */
/*01a0*/ IMAD.SHL.U32 R4, R19, 0x4, RZ ; /* 0x0000000413047824 */
/* 0x000fe200078e00ff */
/*01b0*/ LOP3.LUT R5, R5, 0xc, RZ, 0xc0, !PT ; /* 0x0000000c05057812 */
/* 0x000fc800078ec0ff */
/*01c0*/ LOP3.LUT R4, R4, 0x70, RZ, 0xc0, !PT ; /* 0x0000007004047812 */
/* 0x000fca00078ec0ff */
/*01d0*/ IMAD.IADD R18, R4, 0x1, R5 ; /* 0x0000000104127824 */
/* 0x000fe200078e0205 */
/*01e0*/ @!P1 BRA 0x580 ; /* 0x0000039000009947 */
/* 0x000fea0003800000 */
/*01f0*/ SHF.R.U64 R13, R19, 0x1, RZ ; /* 0x00000001130d7819 */
/* 0x000fe200000012ff */
/*0200*/ IMAD.SHL.U32 R4, R9, 0x10, RZ ; /* 0x0000001009047824 */
/* 0x000fe200078e00ff */
/*0210*/ SEL R20, RZ, 0x1, !P0 ; /* 0x00000001ff147807 */
/* 0x000fe20004000000 */
/*0220*/ IMAD.SHL.U32 R5, R19, 0x8, RZ ; /* 0x0000000813057824 */
/* 0x000fe200078e00ff */
/*0230*/ IADD3 R7, P3, R2, R13.reuse, RZ ; /* 0x0000000d02077210 */
/* 0x080fe20007f7e0ff */
/*0240*/ CS2R R16, SRZ ; /* 0x0000000000107805 */
/* 0x000fe2000001ff00 */
/*0250*/ IADD3 R11, P1, R0, R13, RZ ; /* 0x0000000d000b7210 */
/* 0x000fe40007f3e0ff */
/*0260*/ LOP3.LUT R21, R4, 0x10, RZ, 0xc0, !PT ; /* 0x0000001004157812 */
/* 0x000fe200078ec0ff */
/*0270*/ IMAD.X R6, RZ, RZ, RZ, P3 ; /* 0x000000ffff067224 */
/* 0x000fe200018e06ff */
/*0280*/ LOP3.LUT R4, R5, 0x8, RZ, 0xc0, !PT ; /* 0x0000000805047812 */
/* 0x000fe200078ec0ff */
/*0290*/ IMAD.X R12, RZ, RZ, RZ, P1 ; /* 0x000000ffff0c7224 */
/* 0x000fe200008e06ff */
/*02a0*/ IADD3 R20, P0, R20, R10, RZ ; /* 0x0000000a14147210 */
/* 0x000fe20007f1e0ff */
/*02b0*/ IMAD R6, R6, c[0x0][0x188], RZ ; /* 0x0000620006067a24 */
/* 0x000fc400078e02ff */
/*02c0*/ IMAD.MOV.U32 R5, RZ, RZ, RZ ; /* 0x000000ffff057224 */
/* 0x000fe400078e00ff */
/*02d0*/ IMAD R12, R12, c[0x0][0x188], RZ ; /* 0x000062000c0c7a24 */
/* 0x000fe400078e02ff */
/*02e0*/ IMAD.SHL.U32 R10, R9, 0x20, RZ ; /* 0x00000020090a7824 */
/* 0x000fe400078e00ff */
/*02f0*/ IMAD R27, R7.reuse, c[0x0][0x18c], R6 ; /* 0x00006300071b7a24 */
/* 0x040fe400078e0206 */
/*0300*/ IMAD.WIDE.U32 R6, R7, c[0x0][0x188], R4.reuse ; /* 0x0000620007067a25 */
/* 0x100fe200078e0004 */
/*0310*/ LOP3.LUT R23, R10.reuse, 0x1e0, RZ, 0xc0, !PT ; /* 0x000001e00a177812 */
/* 0x040fe400078ec0ff */
/*0320*/ LOP3.LUT R25, R10, 0xe0, RZ, 0xc0, !PT ; /* 0x000000e00a197812 */
/* 0x000fe200078ec0ff */
/*0330*/ IMAD.SHL.U32 R9, R9, 0x2, RZ ; /* 0x0000000209097824 */
/* 0x000fe200078e00ff */
/*0340*/ LEA R24, P1, R6, c[0x0][0x168], 0x1 ; /* 0x00005a0006187a11 */
/* 0x000fe200078208ff */
/*0350*/ IMAD.WIDE.U32 R4, R11, c[0x0][0x188], R4 ; /* 0x000062000b047a25 */
/* 0x000fc600078e0004 */
/*0360*/ LOP3.LUT R10, R9, 0x10, RZ, 0xc0, !PT ; /* 0x00000010090a7812 */
/* 0x000fe200078ec0ff */
/*0370*/ IMAD R31, R11, c[0x0][0x18c], R12 ; /* 0x000063000b1f7a24 */
/* 0x000fe200078e020c */
/*0380*/ LOP3.LUT R12, R19, 0x10, RZ, 0xc0, !PT ; /* 0x00000010130c7812 */
/* 0x000fe200078ec0ff */
/*0390*/ IMAD.IADD R27, R7, 0x1, R27 ; /* 0x00000001071b7824 */
/* 0x000fe200078e021b */
/*03a0*/ LEA R22, P3, R4, c[0x0][0x160], 0x1 ; /* 0x0000580004167a11 */
/* 0x000fe200078608ff */
/*03b0*/ IMAD.IADD R31, R5, 0x1, R31 ; /* 0x00000001051f7824 */
/* 0x000fe400078e021f */
/*03c0*/ IMAD R21, R13, 0x20, R21 ; /* 0x000000200d157824 */
/* 0x000fe200078e0215 */
/*03d0*/ LEA.HI.X R27, R6, c[0x0][0x16c], R27, 0x1, P1 ; /* 0x00005b00061b7a11 */
/* 0x000fe200008f0c1b */
/*03e0*/ IMAD.IADD R23, R23, 0x1, R12 ; /* 0x0000000117177824 */
/* 0x000fe200078e020c */
/*03f0*/ LEA.HI.X R31, R4, c[0x0][0x164], R31, 0x1, P3 ; /* 0x00005900041f7a11 */
/* 0x000fe200018f0c1f */
/*0400*/ IMAD.IADD R25, R25, 0x1, R10 ; /* 0x0000000119197824 */
/* 0x000fc400078e020a */
/*0410*/ IMAD.X R30, RZ, RZ, R8, P0 ; /* 0x000000ffff1e7224 */
/* 0x000fc400000e0608 */
/*0420*/ ISETP.GT.U32.AND P0, PT, R19, 0xf, PT ; /* 0x0000000f1300780c */
/* 0x000fe20003f04070 */
/*0430*/ IMAD.MOV.U32 R4, RZ, RZ, R22 ; /* 0x000000ffff047224 */
/* 0x000fe400078e0016 */
/*0440*/ IMAD.MOV.U32 R5, RZ, RZ, R31 ; /* 0x000000ffff057224 */
/* 0x000fcc00078e001f */
/*0450*/ LDG.E.128.CONSTANT R4, [R4.64] ; /* 0x0000000404047981 */
/* 0x000ea8000c1e9d00 */
/*0460*/ @!P0 IMAD.MOV.U32 R26, RZ, RZ, R24 ; /* 0x000000ffff1a8224 */
/* 0x000fca00078e0018 */
/*0470*/ @!P0 LDG.E.128.CONSTANT R8, [R26.64] ; /* 0x000000041a088981 */
/* 0x0000e2000c1e9d00 */
/*0480*/ IADD3 R22, P1, R22, 0x20, RZ ; /* 0x0000002016167810 */
/* 0x000fe40007f3e0ff */
/*0490*/ IADD3 R24, P3, R24, 0x20, RZ ; /* 0x0000002018187810 */
/* 0x000fc60007f7e0ff */
/*04a0*/ IMAD.X R31, RZ, RZ, R31, P1 ; /* 0x000000ffff1f7224 */
/* 0x000fe400008e061f */
/*04b0*/ IMAD.X R27, RZ, RZ, R27, P3 ; /* 0x000000ffff1b7224 */
/* 0x001fe200018e061b */
/*04c0*/ STS.128 [R21], R4 ; /* 0x0000000415007388 */
/* 0x004fe80000000c00 */
/*04d0*/ @!P0 STS.128 [R21+0x200], R8 ; /* 0x0002000815008388 */
/* 0x008fe80000000c00 */
/*04e0*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*04f0*/ LDSM.16.M88.2 R28, [R25+0x200] ; /* 0x00020000191c783b */
/* 0x020fe80000000100 */
/*0500*/ LDSM.16.M88.4 R12, [R23] ; /* 0x00000000170c783b */
/* 0x000e280000000200 */
/*0510*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*0520*/ IADD3 R20, P0, R20, -0x1, RZ ; /* 0xffffffff14147810 */
/* 0x000fc80007f1e0ff */
/*0530*/ IADD3.X R30, R30, -0x1, RZ, P0, !PT ; /* 0xffffffff1e1e7810 */
/* 0x000fe400007fe4ff */
/*0540*/ ISETP.NE.U32.AND P0, PT, R20, RZ, PT ; /* 0x000000ff1400720c */
/* 0x000fc80003f05070 */
/*0550*/ ISETP.NE.AND.EX P0, PT, R30, RZ, PT, P0 ; /* 0x000000ff1e00720c */
/* 0x000fe20003f05300 */
/*0560*/ HMMA.16816.F16 R16, R12, R28, R16 ; /* 0x0000001c0c10723c */
/* 0x001b580000000810 */
/*0570*/ @P0 BRA 0x420 ; /* 0xfffffea000000947 */
/* 0x000fca000383ffff */
/*0580*/ NOP ; /* 0x0000000000007918 */
/* 0x000fcc0000000000 */
/*0590*/ STS [R18+0x300], R16 ; /* 0x0003001012007388 */
/* 0x020fe80000000800 */
/*05a0*/ STS [R18+0x380], R17 ; /* 0x0003801112007388 */
/* 0x000fe80000000800 */
/*05b0*/ BAR.SYNC 0x0 ; /* 0x0000000000007b1d */
/* 0x000fec0000000000 */
/*05c0*/ @P2 EXIT ; /* 0x000000000000294d */
/* 0x000fea0003800000 */
/*05d0*/ LEA R4, R19, 0x300, 0x4 ; /* 0x0000030013047811 */
/* 0x000fe200078e20ff */
/*05e0*/ IMAD.MOV.U32 R3, RZ, RZ, RZ ; /* 0x000000ffff037224 */
/* 0x000fe200078e00ff */
/*05f0*/ IADD3 R19, P0, R0, R19, RZ ; /* 0x0000001300137210 */
/* 0x000fc80007f1e0ff */
/*0600*/ LDS.128 R4, [R4] ; /* 0x0000000004047984 */
/* 0x000e220000000c00 */
/*0610*/ IMAD.X R0, RZ, RZ, RZ, P0 ; /* 0x000000ffff007224 */
/* 0x000fe400000e06ff */
/*0620*/ IMAD.WIDE.U32 R2, R19, c[0x0][0x180], R2 ; /* 0x0000600013027a25 */
/* 0x000fc800078e0002 */
/*0630*/ IMAD R0, R0, c[0x0][0x180], RZ ; /* 0x0000600000007a24 */
/* 0x000fe200078e02ff */
/*0640*/ LEA R8, P0, R2, c[0x0][0x170], 0x1 ; /* 0x00005c0002087a11 */
/* 0x000fc600078008ff */
/*0650*/ IMAD R19, R19, c[0x0][0x184], R0 ; /* 0x0000610013137a24 */
/* 0x000fc800078e0200 */
/*0660*/ IMAD.IADD R3, R3, 0x1, R19 ; /* 0x0000000103037824 */
/* 0x000fca00078e0213 */
/*0670*/ LEA.HI.X R9, R2, c[0x0][0x174], R3, 0x1, P0 ; /* 0x00005d0002097a11 */
/* 0x000fca00000f0c03 */
/*0680*/ STG.E.128 [R8.64], R4 ; /* 0x0000000408007986 */
/* 0x001fe2000c101d04 */
/*0690*/ EXIT ; /* 0x000000000000794d */
/* 0x000fea0003800000 */
/*06a0*/ BRA 0x6a0; /* 0xfffffff000007947 */
/* 0x000fc0000383ffff */
/*06b0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*06c0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*06d0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*06e0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*06f0*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0700*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0710*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0720*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0730*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0740*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0750*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0760*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
/*0770*/ NOP; /* 0x0000000000007918 */
/* 0x000fc00000000000 */
..........

It can be found that similar to the WMMA161616 API, the underlying implementation of the MMA16816 PTX instruction is also the HMMA16816 instruction.

6 Other

6.1 HGEMM Optimization

Similar to the WMMA API, the goal of learning MMA PTX is to call Tensor Core to optimize HGEMM. Compared with cublas, what is the performance of MMA? You can refer to the open source code cuda_hgemm.

--

--