# Compressing Deep Neural Networks for Deployment or Training

Hardware Accelerators that Compute Just Right!

Olivier Sentieys Univ. Rennes, Inria, IRISA olivier.sentieys@irisa.fr

Innin -



joint work with Silv<mark>iu Filip, Léo Pradels,</mark> Cédric Gernigon, Sami Ben Ali, Mariko Tatsumi, Guy Lemieux

M M

#### **Complexity Issues of Deep Neural Networks**



- Two main tasks
  - training determine set of network parameters to solve a *task* (minimize a loss on a *training set*)
  - inference given an input, compute (forward propagate) using the trained network



### **Computing power demand of AI**

• is higher than what computer architectures can bring



#### **Evolution of the number of parameters**

• is much higher than available (on-chip) memory capacity



4

### **Evolution of bandwidth**

• is much slower than FLOPS



#### **Memory Bottleneck**



#### Data movement

- move input data & model from memory to compute units
- send partial results back to memory

#### Computations

- vector/matrix manipulations
- done on CPU, GPU, or custom accelerators (e.g., FPGA, ASIC)

#### **Need for DNN Compression**



#### Need for **Meed** N Compression

• From Sensors to Clouds







• For both Inference and Training

## **Computer Architectures**

- Energy consumption is a major issue
- Utilization wall
- End of Moore's law...

- Domain-Specific Architectures are the road ahead
  - Energy efficiency requires deeply specialized hardware
  - which also may come with pain from the programmer



Dark Silicon



# What is DNN quantization?

- store network parameters in low precision
- store/compute intermediate signals in low precision

 store/compute back propagated gradients in low precision



24 bits per pixel

#### **Quantization effects: the good**

#### Memory usage

storage needed for weights and activations is proportional to bit width

#### Power and energy consumption

energy is significantly reduced for both computations and memory accesses

#### Latency

less memory access and simpler computations lead to faster execution and reduced latency

#### Silicon area

8-bit arithmetic and below requires much less area



|      | ADD ene  | Memory access<br>energy (pJ) |       |                |      |
|------|----------|------------------------------|-------|----------------|------|
| INT8 | INT32    | FP16                         | FP32  |                |      |
|      |          |                              |       | Cache (64-bit) |      |
| 0.03 | 0.1      | 0.4                          | 0.9   | 8KB            | 10   |
| 30   | X energ  | 32KB                         | 20    |                |      |
|      | MULT en  | 1MB                          | 100   |                |      |
|      |          | DRAM                         | 1300- |                |      |
| INT8 | INT32    | FP16                         | FP32  |                | 2600 |
| 0.2  | 3.1      | 1.1                          | 3.7   | Up to 4x       |      |
| 18.  | 5x energ | More of <b>loca</b>          |       |                |      |



| MULT area (μm²)           |       |      |      |  |  |  |
|---------------------------|-------|------|------|--|--|--|
| INT8                      | INT32 | FP16 | FP32 |  |  |  |
| 282                       | 3495  | 1640 | 7700 |  |  |  |
| <b>27x</b> area reduction |       |      |      |  |  |  |

| ADD area (µm²)      |       |      |      |  |  |  |
|---------------------|-------|------|------|--|--|--|
| INT8                | INT32 | FP16 | FP32 |  |  |  |
| 36                  | 137   | 1360 | 4184 |  |  |  |
| 116x area reduction |       |      |      |  |  |  |

#### Quantization effects: the good and the bad

**Limited precision** 



#### **DNN Quantization Methods**

- PTQ: Post-Training Quantization
- QAT: Quantization-Aware Training



# Adaptive QAT

- Learning the bit-width during training
  - Both weights and activations

#### Hardware Loss:

$$\begin{split} N_t \in \mathbb{R}^*_+ \text{: bit-width, a new parameter to optimize} \\ \mathbb{L}_{\text{Hard}} &= \alpha \cdot \sum_{l=1}^{L-1} \Big\lceil N_t^{(l)} \Big\rceil \end{split}$$

Total loss:  $L_{\text{Total}} = \lambda \cdot L_{\text{Task}} [N_{\ell}] + (1 - \lambda) \cdot L_{\text{Hard}} [N_{\ell}]$ 

Bit-width gradient approximation:  $\lambda \cdot (L_{\text{Task}} \lceil N_t \rceil - L_{\text{Task}} \lceil N_t - 1 \rceil) + (1 - \lambda) \cdot \frac{\partial L_{\text{Hard}} \lceil N_t \rceil}{\partial \lceil N_t \rceil}$ 



# Adaptive QAT

- Learning the bit-width during training
  - Both weights and activations





### Even Worse for Training...

• Carbon footprint of DNN training

Analyzing the carbon footprint of current natural-language processing models shows an alarming trend: **training one huge model for machine translation emits the same amount of CO2 as five cars in their lifetimes** (fuel included) [Strubell et al., ACL 2019]

- Many more operations than inference
- More pressure on memory access
- Much more difficult to accelerate

#### Need for a Significant Reduction of the Carbon Footprint of Neural Network Training Hardware

## **Low-Precision Training of DNNs**

- Explore mixed numerical precision hardware
  - Low-precision floating-point, variable-precision variants, building the accelerator



### **Arithmetic Support in Latest Chips?**

• Various trade-offs in terms of range, precision, performance



- Nvidia Hopper GH100 GPU
  - FP8 support in tensor cores provides up to 4x speedup



## **Energy Gains of Low Precision**

• Multiplier (float)

Adder (float)



# **DNN Pruning**

а

0

р

u

b

0

0

0

0

Ω

m

r

0

0

0

0

0

е

0

0

t

V

#### **Network Pruning**





# **DNN Pruning**

Structured pruning provides higher efficiency



#### Hardware-Aware Pattern Pruning



22

# Key Takeaways

- Energy efficiency requires deeply specialized hardware
  - Basic tasks of DNNs are easy to accelerate
- Deep knowledge of the hardware is required to propose energy-efficient models and techniques
  - Hardware-aware optimizations are mandatory
  - Structured pruning
  - Scaling quantization to large models, leverage mixed-precision
  - Efficient data-free quantization
  - Low-precision training (e.g., FP8)
- Of course, I am sorry (and not responsible?) for any rebound effect due to this work...

### **TARAN** Team at a Glance

• *Domain-Specific Computers* in the post Moore's law era



Lannion

- IRISA/INRIA
- ~35 people, Rennes and Lannion campuses
- from INRIA, Univ. Rennes 1, ENS Rennes
- Domain-specific computing architectures and compilers
- Energy efficiency, fault tolerance, security