Author Biographies xi
Preface xiii
Acknowledgments xv
Table of Figures xvii
1 Introduction 1
1.1 Development History 2
1.2 Neural Network Models 4
1.3 Neural Network Classification 4
1.3.1 Supervised Learning 4
1.3.2 Semi-supervised Learning 5
1.3.3 Unsupervised Learning 6
1.4 Neural Network Framework 6
1.5 Neural Network Comparison 10
Exercise 11
References 12
2 Deep Learning 13
2.1 Neural Network Layer 13
2.1.1 Convolutional Layer 13
2.1.2 Activation Layer 17
2.1.3 Pooling Layer 18
2.1.4 Normalization Layer 19
2.1.5 Dropout Layer 20
2.1.6 Fully Connected Layer 20
2.2 Deep Learning Challenges 22
Exercise 22
References 24
3 Parallel Architecture 25
3.1 Intel Central Processing Unit (CPU) 25
3.1.1 Skylake Mesh Architecture 27
3.1.2 Intel Ultra Path Interconnect (UPI) 28
3.1.3 Sub Non-unified Memory Access Clustering (SNC) 29
3.1.4 Cache Hierarchy Changes 31
3.1.5 Single/Multiple Socket Parallel Processing 32
3.1.6 Advanced Vector Software Extension 33
3.1.7 Math Kernel Library for Deep Neural Network (MKL-DNN) 34
3.2 NVIDIA Graphics Processing Unit (GPU) 39
3.2.1 Tensor Core Architecture 41
3.2.2 Winograd Transform 44
3.2.3 Simultaneous Multithreading (SMT) 45
3.2.4 High Bandwidth Memory (HBM2) 46
3.2.5 NVLink2 Configuration 47
3.3 NVIDIA Deep Learning Accelerator (NVDLA) 49
3.3.1 Convolution Operation 50
3.3.2 Single Data Point Operation 50
3.3.3 Planar Data Operation 50
3.3.4 Multiplane Operation 50
3.3.5 Data Memory and Reshape Operations 51
3.3.6 System Configuration 51
3.3.7 External Interface 52
3.3.8 Software Design 52
3.4 Google Tensor Processing Unit (TPU) 53
3.4.1 System Architecture 53
3.4.2 Multiply-Accumulate (MAC) Systolic Array 55
3.4.3 New Brain Floating-Point Format 55
3.4.4 Performance Comparison 57
3.4.5 Cloud TPU Configuration 58
3.4.6 Cloud Software Architecture 60
3.5 Microsoft Catapult Fabric Accelerator 61
3.5.1 System Configuration 64
3.5.2 Catapult Fabric Architecture 65
3.5.3 Matrix-Vector Multiplier 65
3.5.4 Hierarchical Decode and Dispatch (HDD) 67
3.5.5 Sparse Matrix-Vector Multiplication 68
Exercise 70
References 71
4 Streaming Graph Theory 73
4.1 Blaize Graph Streaming Processor 73
4.1.1 Stream Graph Model 73
4.1.2 Depth First Scheduling Approach 75
4.1.3 Graph Streaming Processor Architecture 76
4.2 Graphcore Intelligence Processing Unit 79
4.2.1 Intelligence Processor Unit Architecture 79
4.2.2 Accumulating Matrix Product (AMP) Unit 79
4.2.3 Memory Architecture 79
4.2.4 Interconnect Architecture 79
4.2.5 Bulk Synchronous Parallel Model 81
Exercise 83
References 84
5 Convolution Optimization 85
5.1 Deep Convolutional Neural Network Accelerator 85
5.1.1 System Architecture 86
5.1.2 Filter Decomposition 87
5.1.3 Streaming Architecture 90
5.1.3.1 Filter Weights Reuse 90
5.1.3.2 Input Channel Reuse 92
5.1.4 Pooling 92
5.1.4.1 Average Pooling 92
5.1.4.2 Max Pooling 93
5.1.5 Convolution Unit (CU) Engine 94
5.1.6 Accumulation (ACCU) Buffer 94
5.1.7 Model Compression 95
5.1.8 System Performance 95
5.2 Eyeriss Accelerator 97
5.2.1 Eyeriss System Architecture 97
5.2.2 2D Convolution to 1D Multiplication 98
5.2.3 Stationary Dataflow 99
5.2.3.1 Output Stationary 99
5.2.3.2 Weight Stationary 101
5.2.3.3 Input Stationary 101
5.2.4 Row Stationary (RS) Dataflow 104
5.2.4.1 Filter Reuse 104
5.2.4.2 Input Feature Maps Reuse 106
5.2.4.3 Partial Sums Reuse 106
5.2.5 Run-Length Compression (RLC) 106
5.2.6 Global Buffer 108
5.2.7 Processing Element Architecture 108
5.2.8 Network-on- Chip (NoC) 108
5.2.9 Eyeriss v2 System Architecture 112
5.2.10 Hierarchical Mesh Network 116
5.2.10.1 Input Activation HM-NoC 118
5.2.10.2 Filter Weight HM-NoC 118
5.2.10.3 Partial Sum HM-NoC 119
5.2.11 Compressed Sparse Column Format 120
5.2.12 Row Stationary Plus (RS+) Dataflow 122
5.2.13 System Performance 123
Exercise 125
References 125
6 In-Memory Computation 127
6.1 Neurocube Architecture 127
6.1.1 Hybrid Memory Cube (HMC) 127
6.1.2 Memory Centric Neural Computing (MCNC) 130
6.1.3 Programmable Neurosequence Generator (PNG) 131
6.1.4 System Performance 132
6.2 Tetris Accelerator 133
6.2.1 Memory Hierarchy 133
6.2.2 In-Memory Accumulation 133
6.2.3 Data Scheduling 135
6.2.4 Neural Network Vaults Partition 136
6.2.5 System Performance 137
6.3 NeuroStream Accelerator 138
6.3.1 System Architecture 138
6.3.2 NeuroStream Coprocessor 140
6.3.3 4D Tiling Mechanism 140
6.3.4 System Performance 141
Exercise 143
References 143
7 Near-Memory Architecture 145
7.1 DaDianNao Supercomputer 145
7.1.1 Memory Configuration 145
7.1.2 Neural Functional Unit (NFU) 146
7.1.3 System Performance 149
7.2 Cnvlutin Accelerator 150
7.2.1 Basic Operation 151
7.2.2 System Architecture 151
7.2.3 Processing Order 154
7.2.4 Zero-Free Neuron Array Format (ZFNAf) 155
7.2.5 The Dispatcher 155
7.2.6 Network Pruning 157
7.2.7 System Performance 157
7.2.8 Raw or Encoded Format (RoE) 158
7.2.9 Vector Ineffectual Activation Identifier Format (VIAI) 159
7.2.10 Ineffectual Activation Skipping 159
7.2.11 Ineffectual Weight Skipping 161
Exercise 161
References 161
8 Network Sparsity 163
8.1 Energy Efficient Inference Engine (EIE) 163
8.1.1 Leading Nonzero Detection (LNZD) Network 163
8.1.2 Central Control Unit (CCU) 164
8.1.3 Processing Element (PE) 164
8.1.4 Deep Compression 166
8.1.5 Sparse Matrix Computation 167
8.1.6 System Performance 169
8.2 Cambricon-X Accelerator 169
8.2.1 Computation Unit 171
8.2.2 Buffer Controller 171
8.2.3 System Performance 174
8.3 SCNN Accelerator 175
8.3.1 SCNN PT-IS-CP-Dense Dataflow 175
8.3.2 SCNN PT-IS-CP-Sparse Dataflow 177
8.3.3 SCNN Tiled Architecture 178
8.3.4 Processing Element Architecture 179
8.3.5 Data Compression 180
8.3.6 System Performance 180
8.4 SeerNet Accelerator 183
8.4.1 Low-Bit Quantization 183
8.4.2 Efficient Quantization 184
8.4.3 Quantized Convolution 185
8.4.4 Inference Acceleration 186
8.4.5 Sparsity-Mask Encoding 186
8.4.6 System Performance 188
Exercise 188
References 188
9 3D Neural Processing 191
9.1 3D Integrated Circuit Architecture 191
9.2 Power Distribution Network 193
9.3 3D Network Bridge 195
9.3.1 3D Network-on-Chip 195
9.3.2 Multiple-Channel High-Speed Link 195
9.4 Power-Saving Techniques 198
9.4.1 Power Gating 198
9.4.2 Clock Gating 199
Exercise 200
References 201
Appendix A: Neural Network Topology 203
Index 205