Chapter 7Domain-Specific Architectures

(1)

Chapter 7 Domain-Specific Architectures

A Quantitative Approach, Sixth Edition

(2)

Introduction

 Moore’s Law enabled:



Deep memory hierarchy



Wide SIMD units



Deep pipelines



Branch prediction



Out-of-order execution



Speculative prefetching



Multithreading



Multiprocessing

 Objective:



Extract performance from software that is oblivious to architecture

In tro du cti on

(3)

Introduction

 Need factor of 100 improvements in number of operations per instruction



Requires domain specific architectures



For ASICs, NRE cannot be amoratized over large volumes



FPGAs are less efficient than ASICs

du cti on

(4)

Guidelines for DSAs

 Use dedicated memories to minimize data movement

 Invest resources into more arithmetic units or bigger memories

 Use the easiest form of parallelism that matches the domain

 Reduce data size and type to the simplest needed for the domain

 Use a domain-specific programming language

G uid eli ne s f or D S A s

(5)

Guidelines for DSAs _eli

ne s f or D S A s

(6)

Example: Deep Neural Networks

 Inpired by neuron of the brain

 Computes non-linear “activiation” function of the weighted sum of input values

 Neurons arranged in layers

E xa m ple : D ee p N eu ra l N et w or ks

(7)

Example: Deep Neural Networks

 Most practioners will choose an existing design



Topology



Data type

 Training (learning):



Calculate weights using backpropagation algorithm



Supervised learning: stocastic graduate descent

m ple : D ee p N eu ra l N et w or ks

(8)

 Parameters:



Dim[i]: number of neurons



Dim[i-1]: dimension of input vector



Number of weights: Dim[i-1] x Dim[i]



Operations: 2 x Dim[i-1] x Dim[i]



Operations/weight: 2

Multi-Layer Perceptrons

E xa m ple : D ee p N eu ra l N et w or ks

(9)

 Computer vision

 Each layer raises the level of abstraction



First layer recognizes horizontal and vertical lines



Second layer recognizes corners



Third layer recognizes shapes



Fourth layer recognizes features, such as ears of a dog



Higher layers recognizes different breeds of dogs

Convolutional Neural Network ^m _ple

: D ee p N eu ra l N et w or ks

(10)



Parameters:



DimFM[i-1]: Dimension of the (square) input Feature Map



DimFM[i]: Dimension of the (square) output Feature Map



DimSten[i]: Dimension of the (square) stencil



NumFM[i-1]: Number of input Feature Maps



NumFM[i]: Number of output Feature Maps



Number of neurons: NumFM[i] x DimFM[i]

²



Number of weights per output Feature Map:

NumFM[i-1] x DimSten[i]

²



Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map



Number of operations per output Feature Map: 2 x DimFM[i]

²

x Number of weights per output Feature Map



Total number of operations per layer: NumFM[i]

x Number of operations per output Feature Map

= 2 x DimFM[i]

²

x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]

²

x Total number of weights per layer



Operations/Weight: 2 x DimFM[i]

²

Convolutional Neural Network

E xa m ple : D ee p N eu ra l N et w or ks

(11)



Speech recognition and language translation



Long short-term memory (LSTM) network

Recurrent Neural Network ^m _ple

: D ee p N eu ra l N et w or ks

(12)

Recurrent Neural Network

E xa m ple : D ee p N eu ra l N et w or ks



Parameters:



Number of weights per cell:

3 x (3 x Dim x Dim)+(2 x Dim x Dim) + (1 x Dim x Dim) = 12 x Dim

²



Number of operations for the 5 vector-matrix

multiplies per cell: 2 x Number of weights per cell

= 24 x Dim

²



Number of operations for the 3 element-wise

multiplies and 1 addition (vectors are all the size of the output): 4 x Dim



Total number of operations per cell (5 vector-matrix multiplies and the 4

element-wise operations):

24 x Dim

²

+ 4 x Dim



Operations/Weight: ~2

(13)

 Batches:



Reuse weights once fetched from memory across multiple inputs



Increases operational intensity

 Quantization



Use 8- or 16-bit fixed point

 Summary:



Need the following kernels:



Matrix-vector multiply



Matrix-matrix multiply



Stencil



ReLU



Sigmoid



Hyperbolic tangeant

Convolutional Neural Network ^m _ple

: D ee p N eu ra l N et w or ks

(14)



Google’s DNN ASIC



256 x 256 8-bit matrix multiply unit



Large software-managed scratchpad



Coprocessor on the PCIe bus

Tensor Processing Unit

T en so r P ro ce ss in g U nit

(15)

Tensor Processing Unit ^so _{r P}

ro ce ss in g U nit

(16)



Read_Host_Memory



Reads memory from the CPU memory into the unified buffer



Read_Weights



Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit



MatrixMatrixMultiply/Convolve



Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators



takes a variable-sized B256 input, multiplies it by a 256x256 constant input, and produces a B256 output, taking B pipelined cycles to

complete



Activate



Computes activation function



Write_Host_Memory

Writes data from unified buffer into host memory

TPU ISA

T en so r P ro ce ss in g U nit

(17)

TPU ISA ^so _{r P}

ro ce ss in g U nit

(18)

TPU ISA

T en so r P ro ce ss in g U nit

(19)



Read_Host_Memory



Reads memory from the CPU memory into the unified buffer



Read_Weights



Reads weights from the Weight Memory into the Weight FIFO as input to the Matrix Unit



MatrixMatrixMultiply/Convolve



Perform a matrix-matrix multiply, a vector-matrix multiply, an element- wise matrix multiply, an element-wise vector multiply, or a convolution from the Unified Buffer into the accumulators



takes a variable-sized B256 input, multiplies it by a 256x256 constant input, and produces a B256 output, taking B pipelined cycles to

complete



Activate



Computes activation function



Write_Host_Memory

TPU ISA ^so _{r P}

ro ce ss in g U nit

(20)

Improving the TPU

T en so r P ro ce ss in g U nit

(21)

 Use dedicated memories



24 MiB dedicated buffer, 4 MiB accumulator buffers

 Invest resources in arithmetic units and dedicated memories



60% of the memory and 250X the arithmetic units of a server-class CPU

 Use the easiest form of parallelism that matches the domain



Exploits 2D SIMD parallelism

 Reduce the data size and type needed for the domain



Primarily uses 8-bit integers

 Use a domain-specific programming language



Uses TensorFlow

The TPU and the Guidelines ^so _{r P}

ro ce ss in g U nit

(22)

 Needed to be general

purpose and power efficient



Uses FPGA PCIe board with

dedicated 20 Gbps network in 6 x 8 torus



Each of the 48 servers in half the rack has a Catapult board



Limited to 25 watts



32 MiB Flash memory



Two banks of DDR3-1600 (11 GB/s) and 8 GiB DRAM



FPGA (unconfigured) has 3962 18-bit ALUs and 5 MiB of on-chip memory



Programmed in Verilog RTL



Shell is 23% of the FPGA

Microsoft Catapult

M ic ro so ft C ap ap ult

(23)

 CNN accelerator, mapped across multiple FPGAs

Microsoft Catapult: CNN ^ro _so

ft C ap ap ult

(24)

Microsoft Catapult: CNN

M ic ro so ft C ap ap ult

(25)

Microsoft Catapult: Search Ranking ^ro _so

ft C ap ap ult



Feature extraction (1 FPGA)



Extracts 4500 features for every document-query pair, e.g. frequency in which the query appears in the page



Systolic array of FSMs



Free-form expressions (2 FPGAs)



Calculates feature combinations



Machine-learned Scoring (1 FPGA for compression, 3 FPGAs calculate score)



Uses results of previous two stages to calculate floating-point score



One FPGA allocated as a hot-spare

(26)

Microsoft Catapult: Search Ranking

M ic ro so ft C ap ap ult

 Free-form expression evaluation



60 core processor



Pipelined cores



Each core supports four threads that can hide each other’s latency



Threads are statically prioritized according to thread latency

(27)

Microsoft Catapult: Search Ranking ^ro _so

ft C ap ap ult

 Version 2 of Catapult



Placed the FPGA between the CPU and NIC



Increased network from 10 Gb/s to 40 Gb/s



Also performs network acceleration



Shell now consumes 44% of the FPGA



Now FPGA performs only

feature extraction

(28)

Catapult and the Guidelines

M ic ro so ft C ap ap ult

 Use dedicated memories



5 MiB dedicated memory

 Invest resources in arithmetic units and dedicated memories



3926 ALUs

 Use the easiest form of parallelism that matches the domain



2D SIMD for CNN, MISD parallelism for search scoring

 Reduce the data size and type needed for the domain



Uses mixture of 8-bit integers and 64-bit floating-point

 Use a domain-specific programming language

Uses Verilog RTL; Microsoft did not follow this guideline

(29)

Intel Crest ^{l C} _re

st



DNN training



16-bit fixed point



Operates on blocks of 32x32 matrices



SRAM + HBM2

(30)

Pixel Visual Core

P ix el V is ua l C or e

 Pixel Visual Core



Image Processing Unit



Performs stencil operations



Decended from Image Signal processor

(31)

Pixel Visual Core

 Software written in Halide, a DSL



Compiled to virtual ISA



vISA is lowered to physical ISA using application-specific parameters



pISA is VLSI

 Optimized for energy



Power Budget is 6 to 8 W for bursts of 10-20 seconds, dropping to tens of milliwatts when not in use



8-bit DRAM access equivalent energy as 12,500 8-bit integer operations or 7 to 100 8-bit SRAM accesses



IEEE 754 operations require 22X to 150X of the cost of 8-bit integer operations

 Optimized for 2D access

el V is ua l C or e

(32)

Pixel Visual Core

P ix el V is ua l C or e

(33)

Pixel Visual Core ^el ^V _is

ua l C or e

(34)

Pixel Visual Core

P ix el V is ua l C or e

(35)

Visual Core and the Guidelines ^ro _so

ft C ap ap ult

 Use dedicated memories



128 + 64 MiB dedicated memory per core

 Invest resources in arithmetic units and dedicated memories



16x16 2D array of processing elements per core and 2D shifting network per core

 Use the easiest form of parallelism that matches the domain



2D SIMD and VLIW

 Reduce the data size and type needed for the domain



Uses mixture of 8-bit and 16-bit integers

(36)

Chapter 7Domain-Specific Architectures

Chapter 7

Domain-Specific Architectures

A Quantitative Approach, Sixth Edition

Introduction

 Moore’s Law enabled:

Deep memory hierarchy

Wide SIMD units

Deep pipelines

Branch prediction

Out-of-order execution

Speculative prefetching

Multithreading

Multiprocessing

 Objective:

Extract performance from software that is oblivious to architecture

In tro du cti on

Introduction

 Need factor of 100 improvements in number of operations per instruction

Requires domain specific architectures

For ASICs, NRE cannot be amoratized over large volumes

FPGAs are less efficient than ASICs

du cti on

Guidelines for DSAs

 Use dedicated memories to minimize data movement

 Invest resources into more arithmetic units or bigger memories

 Use the easiest form of parallelism that matches the domain

 Reduce data size and type to the simplest needed for the domain

 Use a domain-specific programming language

G uid eli ne s f or D S A s

Guidelines for DSAs eli

ne s f or D S A s

Example: Deep Neural Networks

 Inpired by neuron of the brain

 Computes non-linear “activiation” function of the weighted sum of input values

 Neurons arranged in layers

E xa m ple : D ee p N eu ra l N et w or ks

Example: Deep Neural Networks

 Most practioners will choose an existing design

Topology

Data type

 Training (learning):

Calculate weights using backpropagation algorithm

Supervised learning: stocastic graduate descent

m ple : D ee p N eu ra l N et w or ks

 Parameters:

Dim[i]: number of neurons

Dim[i-1]: dimension of input vector

Number of weights: Dim[i-1] x Dim[i]

Operations: 2 x Dim[i-1] x Dim[i]

Operations/weight: 2

Multi-Layer Perceptrons

E xa m ple : D ee p N eu ra l N et w or ks

 Computer vision

 Each layer raises the level of abstraction

First layer recognizes horizontal and vertical lines

Second layer recognizes corners

Third layer recognizes shapes

Fourth layer recognizes features, such as ears of a dog

Higher layers recognizes different breeds of dogs

Convolutional Neural Network m ple

: D ee p N eu ra l N et w or ks

Parameters:

DimFM[i-1]: Dimension of the (square) input Feature Map

DimFM[i]: Dimension of the (square) output Feature Map

DimSten[i]: Dimension of the (square) stencil

NumFM[i-1]: Number of input Feature Maps

NumFM[i]: Number of output Feature Maps

Number of neurons: NumFM[i] x DimFM[i]

Number of weights per output Feature Map:

NumFM[i-1] x DimSten[i]

Total number of weights per layer: NumFM[i] x Number of weights per output Feature Map

Number of operations per output Feature Map: 2 x DimFM[i]

x Number of weights per output Feature Map

Total number of operations per layer: NumFM[i]

x Number of operations per output Feature Map

= 2 x DimFM[i]

x NumFM[i] x Number of weights per output Feature Map = 2 x DimFM[i]

x Total number of weights per layer

Operations/Weight: 2 x DimFM[i]

Guidelines for DSAs _eli

Convolutional Neural Network ^m _ple

Recurrent Neural Network ^m _ple

Convolutional Neural Network ^m _ple

Tensor Processing Unit ^so _{r P}

takes a variable-sized B256 input, multiplies it by a 256x256 constant input, and produces a B256 output, taking B pipelined cycles to

TPU ISA ^so _{r P}

takes a variable-sized B256 input, multiplies it by a 256x256 constant input, and produces a B256 output, taking B pipelined cycles to