# Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures



**Qingchang Han<sup>1,2</sup>,** Yongmin Hu<sup>1</sup>, Fengwei Yu<sup>2</sup>, Hailong Yang<sup>1</sup>, Bing Liu<sup>2</sup>, Peng Hu<sup>1,2</sup>, Ruihao Gong<sup>1,2</sup>, Yanfei Wang<sup>2</sup>, Rui Wang<sup>1</sup>, Zhongzhi Luan<sup>1</sup>, Depei Qian<sup>1</sup>

School of Computer Science and Engineering Beihang University<sup>1</sup>, Beijing, China SenseTime Research<sup>2</sup>





### Outline

### Background & Motivation

- CNN & Quantized Neural Network
- Low-bit Computation on Modern Computer Architectures

# Optimization Methods

- Low-bit Convolution on ARM CPU
- Low-bit Convolution on NVIDIA GPU

# Evaluation

- Experiment Setup
- Performance Analysis

### Conclusion

### Outline

### Background & Motivation

- CNN & Quantized Neural Network
- Low-bit Computation on Modern Computer Architectures

# Optimization Methods

- Low-bit Convolution on ARM CPU
- Low-bit Convolution on NVIDIA GPU

# Evaluation

- Experiment Setup
- Performance Analysis

### Conclusion

### Convolutional Neural Network



- The computation complexity and memory footprint of CNNs need to be optimized
- Convolution layers take 90% 99% of computation and runtime [Chen et al., ISSCC'16]

### Model Compression



### Accuracy of Quantized Neural Network

|            |                | Top-1 Accuracy @ Precision |                      | Top-5 | Top-5 Accuracy @ Precision |                      |             |       |      |
|------------|----------------|----------------------------|----------------------|-------|----------------------------|----------------------|-------------|-------|------|
| Network    | Method         | 2                          | 3                    | 4     | 8                          | 2                    | 3           | 4     | 8    |
| ResNet-18  |                | F                          | Full precision: 70.5 |       |                            | Full precision: 89.6 |             |       |      |
|            | LSQ (Ours)     | 67.6                       | 70.2                 | 71.1  | 71.1                       | 87.6                 | 89.4        | 90.0  | 90.1 |
|            | QIL            | 65.7                       | 69.2                 | 70.1  |                            |                      |             |       |      |
|            | FAQ            |                            |                      | 69.8  | 70.0                       |                      |             | 89.1  | 89.3 |
|            | LQ-Nets        | 64.9                       | 68.2                 | 69.3  |                            | 85.9                 | 87.9        | 88.8  |      |
|            | PACT           | 64.4                       | 68.1                 | 69.2  |                            | 85.6                 | 88.2        | 89.0  |      |
|            | NICE           |                            | 67.7                 | 69.8  |                            |                      | 87.9        | 89.21 |      |
|            | Regularization | 61.7                       |                      | 67.3  | 68.1                       | 84.4                 |             | 87.9  | 88.2 |
| ResNet-50  |                | Full precision: 76.9       |                      |       | 5.9                        | Full precision: 93.4 |             |       |      |
|            | LSQ (Ours)     | 73.7                       | 75.8                 | 76.7  | 76.8                       | 91.5                 | 92.7        | 93.2  | 93.4 |
|            | PACT           | 72.2                       | 75.3                 | 76.5  |                            | 90.5                 | 92.6        | 93.2  |      |
|            | NICE           |                            | 75.1                 | 76.5  |                            |                      | 92.3        | 93.3  |      |
|            | FAQ            |                            |                      | 76.3  | 76.5                       |                      |             | 92.9  | 93.1 |
|            | LQ-Nets        | 71.5                       | 74.2                 | 75.1  |                            | 90.3                 | 91.6        | 92.4  |      |
| VGG-16bn   |                | Full precision: 73.4       |                      |       | Full precision: 91.5       |                      |             |       |      |
|            | LSQ (Ours)     | 71.4                       | 73.4                 | 74.0  | 73.5                       | 90.4                 | <b>91.5</b> | 92.0  | 91.6 |
|            | FAQ            |                            |                      | 73.9  | 73.7                       |                      |             | 91.7  | 91.6 |
| Squeeze    |                | Full precision: 67.3       |                      |       | Full precision: 87.8       |                      |             |       |      |
| Next-23-2x | LSQ (Ours)     | 53.3                       | 63.7                 | 67.4  | 67.0                       | 77.5                 | 85.4        | 87.8  | 87.7 |

Accuracy Comparison of Low-bit QNNs on ImageNet [Esser et al., ICLR'20]

- Recent works have proved the accuracy of quantized neural network
  - 8-bit quantized model can almost reach the same accuracy as the full-precision one
  - Lower-bit quantized models (e.g., 2~4-bit) only loss the accuracy slightly compared to the full-precision ones
- However, achieving the optimal performance of QNNs across different computer architectures is challenging and less studied in literatures

## The Target Architectures for Optimization

- Most widely used architectures for CNN inference
  - Edge devices ARM CPU
  - Cloud accelerators NVIDIA GPU



- Provide architecture support for low-bit arithmetic instructions
  - ARM CPU: MLA/SMLAL
  - NVIDIA GPU: **dp4a/mma**(Tensor Core)



The shipments of ARM-based chips to date

| Company |                                         | Accelerator March 2019 |       | April 2019 | May 2019 |  |
|---------|-----------------------------------------|------------------------|-------|------------|----------|--|
|         | NVIDIA                                  | GPU                    | 97.0% | 97.3%      | 97.4%    |  |
|         | AMD                                     | GPU                    | 1.2%  | 1.1%       | 1.0%     |  |
|         | Xilinx                                  | FPGA                   | 1.1%  | 1.0%       | 1.0%     |  |
|         | Intel                                   | FPGA                   | 0.6%  | 0.6%       | 0.6%     |  |
|         | Total Types                             | All                    | 1,852 | 1,990      | 2,003    |  |
|         | Source: Liftr Cloud Insights, June 2019 |                        |       |            |          |  |

#### The share of types with Cloud Accelerators

### Low-bit Computation Support on ARM CPU

Low-bit arithmetic instruction



ARMv8.1 architecture



# Low-bit Computation Support on NVIDIA GPU



### Tensor Core

- Natively support mixed-precision GEMM
  - INT8/INT4/INT1 for Turing Tensor Cores
- Powerful inference performance
  - RTX 2080 Ti delivers up to 215.2 TOPS of INT8 inference performance



- Use of Tensor Core
  - WMMA API
  - PTX mma instructions(e.g. mma.m8n8k16)
  - Vendor libraries: cuBLAS/cuDNN (only fp16 now)

# Existing Framework/Library Supporting Low-bit Conv2d

### ARM CPU

- ncnn: 8-bit Conv2d(GEMM-based & Winograd)
- QNNPACK: 8-bit Conv2d(indirect convolution)
- TFLite: 8-bit Conv2d
- TVM: 1/2-bit Conv2d(popcount)/8-bit Conv2d(spatial pack)

### **NVIDIA GPU**

- cuDNN: 8-bit Conv2d(dp4a)/16-bit Conv2d(Tensor Core)
- TensorRT: 8-bit Conv2d(Tensor Core)
- CUTLASS: 1/4/8-bit GEMM(Tensor Core)

- There is no public work that can support extremely low-bit convolution covering a wide range of bit width on ARM CPU (2~8-bit) and NVIDIA GPU (4-bit/8-bit)
- The missing support for extremely low-bit convolution motivates us to provide efficient implementations on ARM CPU and NVIDIA GPU

### Outline

### Background & Motivation

- CNN & Quantized Neural Network
- Low-bit Computation on Modern Computer Architectures

# Optimization Methods

- Low-bit Convolution on ARM CPU
- Low-bit Convolution on NVIDIA GPU

# Evaluation

- Experiment Setup
- Performance Analysis

### Conclusion

### Re-designing GEMM Computation on ARM CPU

- Re-design GEMM micro-kernel
  - 1. Load one column of Matrix A into Buffer A
  - 2. Load one row of Matrix B info Buffer B, and replicate it into each row of Buffer B
  - 3. Perform element-wise multiplication between Buffer A and each column-vector of Buffer B, and store the results to Buffer C
  - 4. After all the calculations are done, copy the data of Buffer C into Matrix C



## Re-designing GEMM Computation on ARM CPU

- Data padding and packing optimization
  - Perform zero-padding when the dimension of data is not a multiple of the required dimension
  - Perform data packing to enable continuous data access



### Instruction and Register Allocation Optimization on ARM CPU

- Optimized instruction schemes for GEMM
  - For 4 to 8-bit GEMM, we choose **SMLAL** and **SADDW** instructions



• For 2 to 3-bit GEMM, we choose **MLA** and **SADDW** instructions



## Instruction and Register Allocation Optimization on ARM CPU

- Register allocation optimization
  - For 4~8-bit input data



Algorithm 1 The 4~8-bit GEMM kernel with register allocation optimization **Input:** Padding\_and\_Packing { *Matrix A* and *Matrix B* } 1: while k > 0 do 2: LD1 {  $v_0$  } addr\_Matrix\_A 3: LD4R {  $v_2 \sim v_5$  } addr Matrix B 4: SMLAL(2) { $v_{10} \sim v_{17}$  } { $v_1$  } { $v_6 \sim v_9$  } 5: LD1 { $v_1$ } addr\_Matrix\_A 6: LD4R {  $v_6 \sim v_9$  } addr Matrix B 7: SMLAL(2) { $v_{10} \sim v_{17}$  } { $v_0$  } { $v_2 \sim v_5$  } 8: 9: MOV { $v_0, v_1$ } {{ $x_0, x_1$ }, { $x_2, x_3$ }} 10: SADDW(2) { $v_{18} \sim v_{31}$  } { $v_{10} \sim v_{16}$  } 11: SADDW(2) { $v_0, v_1$  } { $v_{17}$  } 12: MOV { {  $x_0, x_1$  }, {  $x_2, x_3$  } } {  $v_0, v_1$  } 13:  $k \leftarrow k - unrolling_factor$ 14: 15: end while 16: MOV { $v_0, v_1$ } { { $x_0, x_1$ }, { $x_2, x_3$ } } 17: ST1 { {  $v_{18} \sim v_{31}$  }, {  $v_0, v_1$  } } *addr\_Matrix\_C* 

# Winograd Optimization on ARM CPU

#### Winograd method

$$F(m \times m, r \times r)$$
:  $Y = A^T[[GgG^T] \odot [B^T dB]]A$ 

- Apply F(2x2, 3x3) to  $4 \sim 6$ -bit convolution
  - For more details, please refer to our paper.
  - - The maximum theoretical speedup of F(2x2, 3x3) is 2.25×, however MLA instruction is 2× faster than **SMLAL** instruction

### Implicit-precomp GEMM Method on GPU

- Implicit GEMM
  - Avoid global matrix transformation and reducing memory footprint
- Precomputed Buffer
  - Store the offsets of elements in precomputed buffer



M = (N\*OH\*OW) K = (KH\*KW\*IC) N = OC

### Data Partition along with Thread Hierarchy on GPU

KStep

MFrag

(a) Grid-Level

 Divide the matrix A, B and C into tiles by MTile, NTile, KTile



Algorithm 2 Implicit-precomp GEMM-based Conv2D

|                   | Inp                                                    | out: Shape of convolution and pointers of input, weight and |
|-------------------|--------------------------------------------------------|-------------------------------------------------------------|
|                   |                                                        | output. The precomputed buffer.                             |
|                   | Ti                                                     | ling Parameters: MTile, NTile, KTile, KStep, blockRowWarp-  |
|                   |                                                        | Num and blockColWarpNum.                                    |
|                   | 1:                                                     | compute KTileNum, KStepNum, MFrag, NFrag, warpRowNum        |
| NFrag             |                                                        | and <i>warpColNum</i> ;                                     |
|                   | <b>for</b> <i>k_outer</i> in <i>KTileNum</i> <b>do</b> |                                                             |
|                   | 3:                                                     | load <b>A_Tile</b> to shared memory by precomputed buffer;  |
|                   | 4:                                                     | load <b>B_Tile</b> to shared memory;                        |
|                   | 5:                                                     | syncthreads();                                              |
| <b>B_Fragment</b> | 6:                                                     | <b>for</b> <i>k_inner</i> in <i>KStepNum</i> <b>do</b>      |
| (Register)        | 7:                                                     | load <b>A_Fragment</b> to register;                         |
| NFrag             | 8:                                                     | load <b>B_Fragment</b> to register;                         |
|                   | 9:                                                     | for row in WarpRowNum do                                    |
|                   | 10:                                                    | <b>for</b> col in WarpColNum <b>do</b>                      |
|                   | 11:                                                    | compute <b>C_Fragment</b> by mma instruction;               |
| C Fragment        | 12:                                                    | end for                                                     |
| (Register)        | 13:                                                    | end for                                                     |
|                   | 14:                                                    | end for                                                     |
| Level             | 15:                                                    | add bias and re-quantize on register;                       |
|                   | 16:                                                    | store <b>C_Fragment</b> to global memory;                   |
|                   | 17:                                                    | end for                                                     |
|                   |                                                        |                                                             |

M = (N\*OH\*OW)K = (KH\*KW\*IC)N = OC

## Data Partition along Thread Hierarchy on GPU

#### (b) Block-Level

- Divide C\_Tile, A\_Tile, B\_Tile into fragments by blockRowWarpNum, blockColWarpNum
- Split the *KTile* loop by *KStep*



Input: Shape of convolution and pointers of input, weight and output. The precomputed buffer.
Tiling Parameters: MTile, NTile, KTile, KStep, blockRowWarp-Num and blockColWarpNum.

1: compute *KTileNum*, *KStepNum*, *MFrag*, *NFrag*, *warpRowNum* and *warpColNum*;

Algorithm 2 Implicit-precomp GEMM-based Conv2D

- 2: **for** *k\_outer* in *KTileNum* **do**
- 3: load **A\_Tile** to shared memory by precomputed buffer;
  - load **B\_Tile** to shared memory;
  - \_\_syncthreads(); for k\_inner in KStepNum do
- 7: load **A\_Fragment** to register;
  - load **B\_Fragment** to register; **for** *row* in *WarpRowNum* **do** 
    - for col in WarpColNum do
      - compute **C\_Fragment** by mma instruction;
      - end for
    - end for

#### end for

- add bias and re-quantize on register;
- 16: store **C\_Fragment** to global memory;
- 17: **end for**

4:

5:

6:

8:

9:

10:

11:

12:

13:

14:

15:

# Data Partition along Thread Hierarchy on GPU

(c) Warp-Level

Call Tensor Core through mma instructions to perform the matrix multiplication



Algorithm 2 Implicit-precomp GEMM-based Conv2D

|                                                       | Inpu | ut: Shape of convolution and pointers of input, weight and |  |  |  |  |  |
|-------------------------------------------------------|------|------------------------------------------------------------|--|--|--|--|--|
| output. The precomputed buffer.                       |      |                                                            |  |  |  |  |  |
| Tiling Parameters: MTile, NTile, KTile, KStep, blockR |      |                                                            |  |  |  |  |  |
| Num and blockColWarpNum.                              |      |                                                            |  |  |  |  |  |
|                                                       | 1:   | compute KTileNum, KStepNum, MFrag, NFrag, warpRowNum       |  |  |  |  |  |
|                                                       |      | and <i>warpColNum</i> ;                                    |  |  |  |  |  |
|                                                       | 2: 1 | <b>for</b> <i>k_outer</i> in <i>KTileNum</i> <b>do</b>     |  |  |  |  |  |
|                                                       | 3:   | load <b>A_Tile</b> to shared memory by precomputed buffer; |  |  |  |  |  |
|                                                       | 4:   | load <b>B_Tile</b> to shared memory;                       |  |  |  |  |  |
|                                                       | 5:   | syncthreads();                                             |  |  |  |  |  |
| nt                                                    | 6:   | <b>for</b> <i>k_inner</i> in <i>KStepNum</i> <b>do</b>     |  |  |  |  |  |
|                                                       | 7:   | load <b>A_Fragment</b> to register;                        |  |  |  |  |  |
|                                                       | 8:   | load <b>B_Fragment</b> to register;                        |  |  |  |  |  |
|                                                       | 9:   | <b>for</b> row in WarpRowNum <b>do</b>                     |  |  |  |  |  |
|                                                       | 10:  | for col in WarpColNum do                                   |  |  |  |  |  |
|                                                       | 11:  | compute <b>C_Fragment</b> by mma instruction;              |  |  |  |  |  |
|                                                       | 12:  | end for                                                    |  |  |  |  |  |
| nt                                                    | 13:  | end for                                                    |  |  |  |  |  |
|                                                       | 14:  | end for                                                    |  |  |  |  |  |
|                                                       | 15:  | add bias and re-quantize on register;                      |  |  |  |  |  |
|                                                       | 16:  | store <b>C_Fragment</b> to global memory;                  |  |  |  |  |  |
|                                                       | 17:  | end for                                                    |  |  |  |  |  |
|                                                       | _    |                                                            |  |  |  |  |  |

NFrag

M = (N\*OH\*OW)K = (KH\*KW\*IC)N = OC

# Data Partition along Thread Hierarchy on GPU

#### Auto-tuning of tiling parameters

- Use C++ function template to generate multiple kernels with different combinations of parameters
- Choose the best one through profile runs
- The optimal tiling parameters only need to be determined once per convolution shape with negligible overhead



Algorithm 2 Implicit-precomp GEMM-based Conv2D

- **Input:** Shape of convolution and pointers of input, weight and output. The precomputed buffer.
- **Tiling Parameters:** *MTile*, *NTile*, *KTile*, *KStep*, *blockRowWarp-Num* and *blockColWarpNum*.
- 1: compute KTileNum, KStepNum, MFrag, NFrag, warpRowNum and warpColNum;
- 2: **for** *k\_outer* in *KTileNum* **do**
- 3: load **A\_Tile** to shared memory by precomputed buffer;
- 4: load **B\_Tile** to shared memory;
  - \_\_syncthreads();

5:

6:

7:

8:

9:

10:

11:

12:

13:

14:

15:

16:

- **for** *k\_inner* in *KStepNum* **do** 
  - load **A\_Fragment** to register;
- load **B\_Fragment** to register;
- **for** row in WarpRowNum **do** 
  - for col in WarpColNum do
    - compute **C\_Fragment** by mma instruction;
  - end for
  - end for

#### end for

- add bias and re-quantize on register;
- store **C\_Fragment** to global memory;
- 17: **end for**

### Multi-level Memory Access Optimization on GPU



### Multi-level Memory Access Optimization on GPU



2. Reordering memory access on shared memory

Reduce the number of LDS instructions to 1/4 of the original



# Multi-level Memory Access Optimization on GPU

- 3. Overlapped computation and memory access using registers
  - A temporary buffer on registers to prefetch the data required for the next iteration
  - The processes ① and ④ can be performed simultaneously



- 4. In-place calculation of bias and re-quantization
  - After finishing the **mma** calculation, directly apply bias and re-quantization on the registers

### Quantization Fusion on GPU

#### 1. Fusion of convolution and dequantization

- Directly transform the results from int32 to float32 in convolution kernel
- Skip storing the intermediate results with int8 data type



### 2. Fusion of convFor more details, please refer to our paper.

- Change the truncated range of re-quantization in convolution kernel
- Eliminate the overhead of unnecessary computation and memory access



### Outline

### Background & Motivation

- CNN & Quantized Neural Network
- Low-bit Computation on Modern Computer Architectures

## Optimization Methods

- Low-bit Convolution on ARM CPU
- Low-bit Convolution on NVIDIA GPU

# Evaluation

- Experiment Setup
- Performance Analysis

### Conclusion

### Experiment Setup

- Hardware and software
- Models
  - ResNet-50(all non-redundant layers)
  - DenseNet-121
- Batch size
  - ARM: 1
  - GPU: 1 & 16
- Methods for comparison
  - ARM:
    - ncnn 8-bit Conv2d(baseline)
    - TVM 2-bit Conv2d
  - GPU:
    - cuDNN 8-bit Conv2d with dp4a instruction(baseline)
    - TensorRT 8-bit Conv2d with Tensor Core

#### Table 1: Hardware and software configurations.

| Platform     | ARM CPU              | NVIDIA GPU            |
|--------------|----------------------|-----------------------|
| Device       | Raspberry Pi 3B      | RTX 2080Ti            |
| Architecture | ARM Cortex-A53       | NVIDIA Turing         |
|              |                      | TU102                 |
| Software     | Ubuntu 16.04 LTS     | Ubuntu 16.04 LTS,     |
|              | for Raspberry Pi,    | gcc 5.4.0, CUDA 10.2, |
|              | gcc 5.4.0, ncnn with | cuDNN 7.6.5, Ten-     |
|              | commit 6f2ef19       | sorRT 7               |

### Performance Comparison On ARM CPU



The performance of our optimized 2~7-bit convolution kernels exceeds ncnn in most layers for ResNet-50, with average speedup of 1.60×, 1.54×, 1.38×, 1.38×, 1.34× and 1.27×, respectively

### Performance Comparison with TVM On ARM CPU



 Our 2-bit implementation outperforms TVM in most cases (16 out of 19 cases), with the highest speedup of 2.11× and the average speedup of 1.78×

### Performance Comparison On NVIDIA GPU



- With the batch size of 1, our 4-bit and 8-bit convolution kernels outperform TensorRT in most cases, with the average speedup of 1.78× and 1.44×, respectively
- With the batch size of 16, our 4-bit kernels also outperform TensorRT in 12 layers by an average speedup of 1.46×



The average speedup of 4-bit and 8-bit convolution kernels with the profile runs enabled is 2.29× and 2.91×, respectively

### Space Overhead

- GPU: Negligible overhead consumed by precomputed buffer
- ARM: The space overhead of im2col, data padding and packing operations
  - The baseline is space occupation of activation and weight for each layer
  - The overhead of im2col for some layers(e.g., conv2 and conv6) is relatively high
  - The space overhead of im2col is determined by convolution kernel size, stride, and input size



### Outline

### Background & Motivation

- CNN & Quantized Neural Network
- Low-bit Computation on Modern Computer Architectures

# Optimization Methods

- Low-bit Convolution on ARM CPU
- Low-bit Convolution on NVIDIA GPU

# Evaluation

- Experiment Setup
- Performance Analysis

### Conclusion

### Conclusion

- Explore extremely low-bit convolution optimizations
  - ARM CPU
    - Re-design GEMM computation
    - Instruction and register allocation optimization
    - Winograd optimization
  - NVIDIA GPU
    - Data partition along with thread hierarchy
    - Multi-level memory access optimization
    - Quantization fusion
- Significant speedup compared to existing framework/library
  - ARM CPU: 1.60 x (2-bit) / 1.38 x (4-bit)
  - NVIDIA GPU: 5.26 x (4-bit) / 4.31 x (8-bit)

# **Thanks! Q&A**