# 2nd TVM and Deep Learning Compilation Conference Sampl PAUL G. AΙ SCHUUE December 5, 2019













## Welcome to the 1st 2nd TVM and Deep Learning Compilation Conference!

## Welcome to the 1st 2nd TVM and Deep Learning Compilation Conference!





## Welcome to the 1st 2nd TVM and Deep Learning Compilation Conference!



## 2020





### Problem to solve



### Train on *fa\$te\$t* machine

Inference on fast & cheap enough machine



### **Problem to solve**

### Model size and compute cost growing fast



by Eugenio Culurciello

### Data + model templates

### Train on *fa\$te\$t* machine

Inference on fast & cheap enough machine



#### **Problem to solve**

## Model size and compute cost growing fast





### Train on fa\$te\$t machine

## Training costs growing exponentially





#### Problem to solve

## Model size and compute cost growing fast





### Train on *fa\$te\$t* machine

## Training costs growing exponentially





#### **Problem to solve**

## Model size and compute cost growing fast



1



#### Train on *fa\$te\$t* machine

Inference on fast & cheap enough machine

## Training costs growing exponentially





**Problem to solv** 

MI Technology Review

## Model size and compute cost grow



## **Training a single Al** model can emit as much carbon as five cars in their lifetimes

Deep learning has a terrible carbon footprint.

by Karen Hao

The artificial-intelligence industry is often compared to the oil industry: once mined and refined, data, like oil, can be a highly lucrative commodity. Now it seems the metaphor may extend even further. Like its fossil-fuel counterpart, the process of deep learning by Open Al

by Eugenio Culurciello

Jun 6, 2019

### Inference on fast & cheap enough machine

ntially

**Increase in Compute** 

## 5M in EC2 costs! AlphaGo Zero



ind mildew in the pipes sold last month for \$1.23 million.







## It gets more serious...





Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp

#### 42 Years of Microprocessor Trend Data



## It gets more serious...





Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp



# Impact of ML will be limited if we don't squeeze as much efficiency as we can!

# Impact of ML will be limited if we don't squeeze as much efficiency as we can!

## Model, SW and HW optimization are key...

Cambrian explosion of models, workloads, and use cases.

CNN

#### RNN DQNN GAN MLP



Growing set of requirements: cost, latency, power, security & privacy

CNN

Cambrian explosion of models, workloads, and use cases.

#### GAN RNN MLP DQNN



Growing set of requirements: cost, latency, power, security & privacy

Cambrian explosion of models, workloads, and use cases.

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.



CNN





#### RNN DQNN GAN MLP





Growing set of requirements: cost, latency, power, security & privacy

Cambrian explosion of models, workloads, and use cases.

Rapidly evolving ML software ecosystem quickly fragmenting

Silicon scaling limitations (Dennard and Moore):

Cambrian explosion of HW backends. Heterogeneous HW.















year of introduction

#### **Caffe**2 ONNX ONNX F TensorFlow Lite **NNVM** would be *really* nice... **Deep Graph** Library nGraph ONNC **SplaidML S**LOW K Keras Relay PyTorch taco Tensor 🔀 Comprehensions DLVM mxnet VIDIA TensorR1 torch Halide **-**tvm Tiramisu TensorFlow **AMD HCC** Caffe Compiler **MLIR** 2013 2012 2011 2014 2015 2016 2017 2018 2019

# Deep learning "stack" (r?) evolution Lots of hand-tuning, full automation **NVIDIA NVCC** GCC <=2010

## theano



year of introduction



year of introduction

## Current Dominant Deep Learning Systems Landscape

Orchestrators **Kubeflow** Frameworks and Inference engines nGraph C G L O W MLIR DL Compilers Kernel MKL-DNN NNPack cuDNN Libraries Hand optimized

Hardware











## Current Dominant Deep Learning Systems Landscape

Orchestrators **Kubeflow** Frameworks and Inference engines nGraph DL Compilers Kernel NNPack cuDNN Libraries Hand optimized

Hardware







## MKL-DNN





## **h**tvm

Open source, automated end-toend optimization framework for deep learning.





## Using ML for better ML systems.

Deal with design complexity and large parameter spaces...



## Using ML for better ML systems.

Deal with design complexity and large parameter spaces...

Model optimization strategies and parameters

Efficient operator implementations

Data communication patterns

Model-HW co-tuning

Searching for efficient HW designs



More hardware backends (e.g., CortexM, RISC-V, DSPs)

More optimizations (e.g., quantization, data layout)

More hardware backends (e.g., CortexM, RISC-V, DSPs)

More optimizations (e.g., quantization, data layout)

More hardware backends (e.g., CortexM, RISC-V, DSPs)

## Usability (tutorials, docs, automation), community development



# **The Source Community Growth and Impact**

70% growth from Dec 2018 to 295 contributors from UW, Berkeley, Cornell, UCLA, Amazon, Huawei, NTT, Facebook, Microsoft, Qualcomm, Alibaba, Intel, ...



# **TEVM** Open Source Community Growth and Impact

70% growth from Dec 2018 to 295 contributors from UW, Berkeley, Cornell, UCLA, Amazon, Huawei, NTT, Facebook, Microsoft, Qualcomm, Alibaba, Intel, ...

Used in production at leading vendors:



Deep Learning **Compiler Service** 



Tensor Engine for mobile ASIC



Mobile and Server Optimizations



Cloud-side model optimization







# **The Source Community Growth and Impact**

70% growth from Dec 2018 to 295 contributors from UW, Berkeley, Cornell, UCLA, Amazon, Huawei, NTT, Facebook, Microsoft, Qualcomm, Alibaba, Intel, ...

Used in production at leading vendors:



Deep Learning **Compiler Service** 



Tensor Engine for mobile ASIC



Incubated as Apache TVM recently. Independent governance, allowing competitors to collaborate.



Microsoft

Mobile and Server Optimizations

Cloud-side model optimization







Used in production at leading vendors:









# Jeff Gehlhaar



University of Washington

**Dec 2019** 

# Qualcomm Technologies, Inc. Al Overview

Jeff Gehlhaar, VP Technology Qualcomm Technologies, Inc.



Qualcom



# We're creating a future of distributed intelligence

Our platforms are enabling a world of decentralized computing to realize the true potential of AI at scale. On-device inference processes data closest to the source for maximum speed and security, and low-latency 5G connectivity augments experiences with edge cloud processing for training updates and connected services.





Our process

### We design and develop holistic Al systems

Our process provides a comprehensive approach to AI research and development. We take on hard problems and tackle complexity head on to meticulously design and build systems that deliver complete end-to-end AI solutions, from fundamental research to product execution.

### Al Research

# Qualcom





### Our AI software products

Qualcomm Neural Processing SDK, Qualcomm Hexagon and Qualcomm AI Engine are products of Qualcomm Technologies, Inc.



























#### Currently supports ~100 ops





#### Currently supports ~100 ops

 Handwritten and optimized across 3 different Hexagon architecture variations





#### Currently supports ~100 ops

 Handwritten and optimized across 3 different Hexagon architecture variations

 Ops have to be written for both Hexagon Vector Extensions (HVX) and Hexagon Tensor Accelerator (HTA) units





#### Hexagon NN

#### Currently supports ~100 ops

 Handwritten and optimized across 3 different Hexagon architecture variations

 Ops have to be written for both Hexagon Vector Extensions (HVX) and Hexagon Tensor Accelerator (HTA) units

Incredible demand from customers to add new operators and operator variants





TVM gives us internal development advantage and gives customers a tool to to develop custom operators.

#### Hexagon NN

- Currently supports ~100 ops
- Handwritten and optimized across 3 different Hexagon architecture variations
- Ops have to be written for both Hexagon Vector Extensions (HVX) and Hexagon Tensor Accelerator (HTA) units
- Incredible demand from customers to add new operators and operator variants
- Hexagon is a flexible and power efficient but complex IP block to program efficiently. Like Halide for CV applications,







TVM gives us internal development advantage and gives customers a tool to to develop custom operators.

### TVM is key to ML Access on Hexagon

#### Hexagon NN

- Currently supports ~100 ops
- Handwritten and optimized across 3 different Hexagon architecture variations
- Ops have to be written for both Hexagon Vector Extensions (HVX) and Hexagon Tensor Accelerator (HTA) units
- Incredible demand from customers to add new operators and operator variants
- Hexagon is a flexible and power efficient but complex IP block to program efficiently. Like Halide for CV applications,





### Key Ideas and Innovations

Qualcomm Technologies, Inc. is a leader in silicon for on-device and cloud solutions

Hexagon hardware provides a key power / performance advantage but is complicated to optimize

TVM and domain specific languages are key for per-kernel and whole graph optimization strategies

Our Qualcomm AI Research is advancing hardware aware optimization strategies

Qualcomm AI Research is an initiative of Qualcomm Technologies, Inc



### Qualcom

# Thank you

#### Follow us on: **f y** in O

For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog

Nothing in these materials is an offer to sell any of the components or devices referenced herein.

©2018-2019 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved.

Qualcomm, Snapdragon and Hexagon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to "Qualcomm" may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm's licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm's engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.

# Yida Wang



### amazon

### AWS AI

### AWS AI

- The broadest and most complete set of machine learning capabilities
  - Al Services
  - Amazon SageMaker
  - ML Frameworks & Infrastructure

### AWS AI

- The broadest and most complete set of machine learning capabilities
  - Al Services
  - Amazon SageMaker
  - ML Frameworks & Infrastructure
- - 81% of deep learning in cloud runs on AWS

# • More machine learning happens on AWS than anywhere else

• As a cloud service: Amazon SageMaker Neo

- As a cloud service: Amazon SageMaker Neo
- As a solution
  - Fastest model inference on a number of Amazon EC2 instances
  - Alexa Wakeword model on Amazon Echo
  - Collaborating with a number of external device makers

- As a cloud service: Amazon SageMaker Neo
- As a solution
  - Fastest model inference on a number of Amazon EC2 instances
  - Alexa Wakeword model on Amazon Echo
  - Collaborating with a number of external device makers
- As a research project
  - Three accepted peer-reviewed papers
  - More under review and in preparation

- As a cloud service: Amazon SageMaker Neo
- As a solution
  - Fastest model inference on a number of Amazon EC2 instances
  - Alexa Wakeword model on Amazon Echo
  - Collaborating with a number of external device makers
- As a research project
  - Three accepted peer-reviewed papers
  - More under review and in preparation
- As a compiler
  - AWS Inferentia

 Join the effort from the very beginning, one of the major contributors

- Join the effort from the very beginning, one of the major contributors
- Major features in the past year
  - Frontend: TF object detection model
  - Relay: pass manager, VM, QNN dialect, graph partitioning

  - Optimization: vision-specific ops, conv2d\_transpose, sparsity, BERT • Runtime: bring your own codegen

- Join the effort from the very beginning, one of the major contributors
- Major features in the past year
  - Frontend: TF object detection model
  - Relay: pass manager, VM, QNN dialect, graph partitioning

  - Optimization: vision-specific ops, conv2d\_transpose, sparsity, BERT • Runtime: bring your own codegen
- Service in the community
  - 2 PMC members, 8 committers, 14 reviewers, and growing • Active participation and leadership



# Jason Knight

# OctoML



### Secure and efficient deep learning everywhere













N = number of people building machine learning models





N = number of people building machine learning models

M = number of software developers





N = number of people building machine learning models

M = number of software developers

N >> M





N = number of people building machine learning models

M = number of software developers

N >> M

as t  $\rightarrow \infty$ 















### **Deployment Pain/Complexity**

- Model ingestion
- Performance estimation and comparison
- Cartesian product of models, frameworks, and hardware
- Optimization
  - 00, 01, 02 0
  - Target settings: march, mtune, mcpu Ο
  - Size reductions Ο
  - Quantization, pruning, distillation Ο
- Custom operators (scheduling, cross hardware support)
- Lack of portability / varying coverage across frameworks
- Model integration
  - Output portability Ο
  - Packaging (Android APK, iOS ipa, Python wheel, Maven artifact, etc) Ο

OctoML













TVM is core to making that happen.





TVM is core to making that happen.

... but it's only the first (important!) step





### What are we doing about it?

To make DL deployment easy for everyone: 1. Strengthen the core:

- Invest in open source TVM for robustness, accessibility, community, and coverage Ο
- (See next slide) Ο





### OctoML investments into TVM

OctoML invests in TVM

Talks today:

Unified IR – Tiangi Chen Dynamic Execution and Virtual Machine – Jared Roesch and Haichen Shen uTVM: TVM on bare-metal devices – Logan Weber TVM at OctoML – Jason Knight

Not presented today:

TVM Transformer Improvements – Josh Fromm Automatic Quantization – Ziheng Jiang OctoML





### What are we doing about it?

To make DL deployment easy for everyone: 1. Strengthen the core:

- Invest in open source TVM for robustness, accessibility, community, and coverage Ο
- (See next slide) Ο





### What are we doing about it?

To make DL deployment easy for everyone: 1. Strengthen the core:

- Invest in open source TVM for robustness, accessibility, community, and coverage Ο
- (See next slide) Ο
- 2. Build additional stepping stones
  - By forming a company! (come see our OctoML talk in the afternoon) Ο









Simple, secure, and efficient deployment of ML models in the edge and the cloud

Drive TVM adoption Core infrastructure and improvements

Apache TVM ecosystem



# OctoML





Expand the set of users who can deploy ML models: Services, automation, and integrations





### Team - The Octonauts



Luis Ceze Co-founder, CEO PhD in Computer Architecture and Compilers Professor at UW-CSE Venture Partner, Madrona Ventures



Jason Knight Co-founder, CPO PhD in Computational **Biology and Machine** Learning



Logan Weber



An Wang



Josh Fromm



Zachary Tatlock

**S**tvm





Tianqi Chen Co-founder, CTO PhD in Machine Learning Professor at CMU-CS



Thierry Moreau Co-founder, Architect PhD in Computer Architecture



Jared Roesch Co-founder, Architect (soon) PhD in Programming Languages

Advisors

Andrew McHarg Ziheng Jiang Amanda Robles



Jay Bartot



**Carlos Guestrin** 



Arvind Krishnamurthy















### Find out more!

Come to our presentation about the Octomizer this afternoon

- Our first SaaS product for making DL deployment easy Ο
  - Push button AutoTVM optimization
  - Perf comparisons/analysis across models, frameworks, and hardware
  - And more!



https://octoml.ai (mailing list signup) **@octoml** on Twitter Email us! (jknight@octoml.ai)















## Let's Get in the Wayback Machine





# Let's Get in the Wayback Machine



- We apply the straightforward insight that
- machine learning models are just programs.
- This generalization enables support for a greater range of programs, new optimizations, range of devices. and the ability to target
- http://sampl.cs.washington.edu

http://tvm.ai



# Challenges for Deep Learning IRs

- State-of-the-art models increasingly depend on:
  - Datatypes lists, trees, graphs
  - Control flow branches, loops, recursion
  - Whole-program analyses and optimizations
- Any one feature "easy to bolt on"
- Folklore suggests full, expressive IR will be slow

let encode =  $\lambda$  st. **if(...)**: encode(step(st)) else:



# Challenges for Deep Learning IRs

- State-of-the-art models increasingly depend on:
  - Datatypes lists, trees, graphs
  - Control flow branches, loops, recursion
  - Whole-program analyses and optimizations
- Any one feature "easy to bolt on"
- Folklore suggests full, expressive IR will be slow

let encode =  $\lambda$  st. **if(...)**: encode(step(st)) else:



- Relay generalizes NNVM
- Retains graph-level optimizations
- Provides more expressive features
  - Datatypes, control flow, code re-use
  - Functional semantics to simplify analysis
  - Automatic differentiation + optimizations

# The Relay IR

Expr e ::= %l $const((r \mid b), s, bt)$  $e(\langle \tau, \ldots, \tau \rangle)?(e, \ldots, e)$ let  $%l(:\tau)$ ? = e; e *e*; *e* %graph = e; e fn ((tyParam, ..., tyParam))? (param, ..., param)  $(\rightarrow \tau)$ ? {e}  $(e, \ldots, e)$ e.n if (e) {e} else {e} match (e) {  $p \rightarrow e$  $p \rightarrow e$ op ref(e) l e *e*:=*e* 

~ "OCaml for ML"





### High-level Relay models match NNVM in traditional vision inference



### High-level Relay models match NNVM in traditional vision inference

- Low-cost abstraction enabled by:
  - Tensor shape inference and specialization
  - High-level operator fusion
  - Whole-program partial evaluation

| <pre>%d1 })(%x1); %x2.1 := ones_like(%x2.0) let %x3 = read(%x)(); (%x2.0, (read(%x1.1),)) }</pre> |
|---------------------------------------------------------------------------------------------------|
|                                                                                                   |

### **Relation-T**

 $\Delta, T_1$ : Type, ...,  $T_n$ : Type  $\vdash (Rel(T_1, T_2, \ldots, T_n) \in \{\top, \bot\})$ 

 $\Delta; \Gamma \vdash Rel$  : Relation

### **Type-Func-Def**

 $\forall i \in [1, r] \Delta; \Gamma \vdash R_i(T_1, \ldots, T_n, O)$  $\Delta; \Gamma, a_1: T_1, \ldots, a_n: T_n, \quad f: fn(T_1, \ldots, T_n) \to O \text{ where } R_1, \ldots, R_r \vdash body: O$  $\Delta; \Gamma \vdash \mathsf{def} \ @f(a_1:T_1, \ldots, a_n:T_n) \to O \text{ where } R_1, \ldots, R_r \{ body \}:$  $fn(T_1,\ldots,T_n) \rightarrow O$  where  $R_1,\ldots,R_r$ 

### **Type-Call**

 $\Delta; \Gamma \vdash f : fn(T_1, \ldots, T_n) \rightarrow O$  where  $R_1, \ldots, R_r$  $\Delta; \Gamma \vdash a_1 : T_1, \ldots, a_n : T_n \qquad \forall i \in [1, r] \Delta; \Gamma \vdash R_i(T_1, \ldots, T_n, O)$  $\Delta; \Gamma \vdash f(a_1, \ldots, a_n) : O$ 





### • Low-cost abstraction enabled by:

### But most of all by extensible, composable optimization framework!



**Relation-T** 

 $\Delta, T_1$ : Type, ...,  $T_n$ : Type  $\vdash$   $(Rel(T_1, T_2, \ldots, T_n) \in \{\top, \bot\})$ 



# Relay Win: Support for New Models



### High-level Relay models for RNNs and LSTMs can outperform the rest



# Relay Win: Support for New Models



### High-level Relay models for RNNs and LSTMs can outperform the rest

# Relay Win: Support for New Models

### Plus support for new/improved targets via high-level transformations:



### High-level Relay models for RNNs and LSTMs can outperform the rest



## Research Ready Production Ready

### [RELEASE][DRAFT] TVM v0.6

① Open tqchen opened this issue 29 days ago · 38 comments



tqchen commented 29 days ago • edited by yzhliu -

Dear Community, thanks to everyone's effort in the pas release.

This release will be managed by the TVM PMC, with @y few days we will be populating the release note in this t derived from our monthly report

We also encourage everyone in the community to reply be included in the v0.6.

It is our first release after moving to the apache repo. S reviews to reviews to released product matches the this eas a stre the future releases

### **New Features**

### **Relay in Production**

ional, ferences, programming langua v is a Late repletion for machine learning systems. Relay supports algebraic data types, closures, control flow, and recursion, allowing it to directly represent more complex models than computation graph-based IRs (e.g., NNVM) can. In TVM v0.6, Relay is in stable phase and is ready for production.

- Algebraic Data Types (ADT) support (#2442, #2575). ADT provides an expressive, efficient, and safe way to realize recursive computation (e.g., RNN). Refer to https://docs.tvm.ai/langref /relay\_adt.html for more information
- Pass manager for Relay (#2546, #3226, #3234, #3191)
- Most frameworks have been supported in Relay, including ONNX, Keras, Tensorflow, Caffe2, CoreML, NNVMv1, MXNet (#2246).
- Explicitly manifest memory and tensor allocations in Relay. (#3560)

### **Relay Virtual Machine**

The Relay Virtual Machine (Relay VM) is the new generation of runtime to strike a balance between performance and flexibility when deploying and executing Relay programs. Previously, the graph runtime is able to utilize the fully static nature of the input graphs to perform aggressive optimization such as fully static allocation, and optimal memory reuse. When we introduce models which make use of control-flow, recursion, dynamic shapes, dynamic allocation we must change how execution works.

| Release candidate #42                                                                          | 259 New issue                            |
|------------------------------------------------------------------------------------------------|------------------------------------------|
| Member ····                                                                                    | Assignees                                |
| t few months. This is a proposal to do a v0.6                                                  | syzhliu 💽 tqchen                         |
| <b>zhliu</b> and myself as moderators. In the next<br>hread. Most release note content will be | Labels<br>type: roadmap                  |
| to the thread about pending PRs that should                                                    | <b>Projects</b><br>None yet              |
| o the main goal is about passing the general<br>ASF requirements. We hope that we can          | <b>Milestone</b><br>No milestone         |
|                                                                                                | 10 participants                          |
| age designed to be an expressive                                                               | in i |

## Relay + You!

- Relay merged in to TVM mainline
  - Documentation, tutorials, examples
  - Add your own analyses and optimizations
  - Target new accelerators
  - Support new models
  - Tons of community support!





















+ many more amazing folks!

## Relay + You!

- Relay merged in to TVM mainline
  - Documentation, tutorials, examples
  - Add your own analyses and optimizations
  - Target new accelerators
  - Support new models
  - Tons of community support!





















### + many more amazing folks!





Tianqi Chen







Hardware





Hardware



end optimization framework for deep learning.











Primitive Tensor operators such as Conv2D



# i Frameworks



#### **cuDNN**





# i Frameworks



### **cuDNN**









# Frameworks



# **cuDNN NVIDIA**



# i Frameworks



# **cuDNN NVIDIA**











New operator introduced by operator fusion optimization potential benefit: 1.5x speedup







New operator introduced by operator fusion optimization potential benefit: 1.5x speedup







# New operator introduced by operator fusion optimization potential benefit: 1.5x speedup

















New operator introduced by operator fusion optimization potential benefit: 1.5x speedup

# **Engineering intensive**



























Machine Learning based Program Optimizer













Machine Learning based Program Optimizer

Hardware







**Directly generate optimized program** for new operator workloads and hardware











# Why Automation is the Future

Clear winner on emerging models in product

Competitive on benchmarking type model

Quickly enables other optimizations: fusion, layout, parallelization

Portable performance across devices





#### High-Level Differentiable IR

#### Tensor Expression and Optimization Search Space

#### LLVM, CUDA, Metal











# Community Highlights

# More **Dynamism**

# **Tiny** machine learning

### Better core Infra

More Specialized Accelerator Support

# Community Highlights

# More **Dynamism**

# Tiny machine learning

Better core Infra

More Specialized Accelerator Support

### Model



### static computational graph







### static computational graph







### program with loops and recursions





### static computational graph





### single tensor with known shape





### program with loops and recursions





### static computational graph





### single tensor with known shape





### program with loops and recursions





#### sequence, trees, nested data structure





# Relay Virtual Machine

#### source program



Dynamic shape workloads

More runtime objects: Arrays, Tuples, Trees, ADTs

Minimum runtime for dynamic models

**Credit: Jared Roesch, Haichen Shen et.al** 

### VM bytecode and runtime



# Community Highlights

## More **Dynamism**

# Tiny machine learning

Better core Infra

More Specialized Accelerator Support

# Machine Learning is Getting into Tiny Devices

# Challenges: limited resources, OS support









# uTVM: TVM on bare-metal Devices

# Support bare-metal J-TAG devices, no OS is needed

### ARM Cortex-M RISC-V



#### **Credit: Logan Weber et al**

# Community Highlights

## More **Dynamism**

# Tiny machine learning

### Better core Infra

More Specialized Accelerator Support

# Core Infrastructure

## New integer simplification and analysis

Unified runtime object protocol

# Core Infrastructure

## New integer simplification and analysis

Unified runtime object protocol

| Module  | AST/IR nodes  |
|---------|---------------|
| NDArray | Tuple/Closure |



# Core Infrastructure

## New integer simplification and analysis

Unified runtime object protocol



# Core Infrastructure

New integer simplification and analysis

Unified runtime object protocol

Easy to add new objects (trees, graphs)

Cross language support





# Community Highlights

### More **Dynamism**

## Tiny machine learning

Better core **Infra** 

More Specialized Accelerator Support

## Tensorization Challenge for Specialized Accelerators

### TPUs



#### Tensor Compute Primitives





#### Explicitly Managed Memory Subsystem



## Tensorization Challenge for Specialized Accelerators

### TPUs







#### Explicitly Managed Memory Subsystem



#### Compute primitives

### Compute primitives



# Compute primitives







vector

# Compute primitives









#### Compute primitives



Challenge: Build systems to support emerging tensor instructions



### **Computation Specification (Tensor Expression)**

- lambda y, x: tvm.sum(A[k, y] \* B[k, x], axis=k))











**HW Interface Specification by Tensor Expression** 

### **Computation Specification (Tensor Expression)**

- C = tvm.compute((m, n),
  - lambda y, x: tvm.sum(A[k, y] \* B[k, x], axis=k))











**HW Interface Specification by Tensor Expression** 

### **Computation Specification (Tensor Expression)**

- C = tvm.compute((m, n),
  - lambda y, x: tvm.sum(A[k, y] \* B[k, x], axis=k))

**Tensorization** 











**Credit: Siyuan Feng** 

# TVM for TensorCore



**Credit: Siyuan Feng** 

# TVM for TensorCore





### VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

<section-header><section-header><complex-block><image><image><image><image>

HW-SW Blueprint for Flexible Deep Learning Acceleration. Moreau et al. IEEE Micro 2019.

# VTA: Open & Flexible Deep Learning Accelerator

### compiler, driver, hardware design full stack open source



### VTA Runtime & JIT Compiler

VTA Hardware/Software Interface (ISA)

**VTA** Simulator **VTA MicroArchitecture** u 😔 🔗 🖉 🗊 🗃 🗊 🖬 😒 👄 🤜 🏹 🔟 🔭 📿 🏵 🔯 🔟 ĨĨ amazon webservices

HW-SW Blueprint for Flexible Deep Learning Acceleration. Moreau et al. IEEE Micro 2019.

# VTA: Open & Flexible Deep Learning Accelerator

- Runtime JIT compile accelerator micro code
- Support heterogenous devices, 10x better than CPU on the same board.
- Move hardware complexity to software

 VTA 2.0 release - Chisel compiler, driver, hardware design full stack open source



# TSIM: Support for Future Hardware



#### Current TVM Stack

### New NPU Runtime

#### **TSIM Driver**



#### **Credit: Luis Vega, Thierry Moureau**

# TSIM: Support for Future Hardware



#### Current TVM Stack

### New NPU Runtime

**TSIM Driver** 

### **TSIM Binary**



#### **Credit: Luis Vega, Thierry Moureau**

#### New Hardware Design in Verilog

Verilator

# TSIM: Support for Future Hardware



#### Current TVM Stack

### New NPU Runtime

**TSIM Driver** 

### **TSIM Binary**



#### **Credit: Luis Vega, Thierry Moureau**

#### New Hardware Design in Verilog



Verilator

# Where are we going: Selected Topics

## **Unified Runtime**

## **Unified IR**

## **Full-stack Automation**

# Where are we going: Selected Topics

## **Unified Runtime**

## **Unified IR**

**Full-stack Automation** 

#### **NPU Driver**

### **Device Drivers**



#### **CUDA** Driver







tvm::runtime::Module

#### **NPU Driver**

**Device Drivers** 



**CUDA** Driver



**Runtime Module Interface** 

GetFunction(string) -> tvm::runtime::PackedFunc

SaveToBinary/LoadFromBinary







tvm::runtime::Module

### NPUModule

#### **NPU Driver**

Device Drivers



### CUDAModule

#### **CUDA** Driver



**Runtime Module Interface** 

GetFunction(string) -> tvm::runtime::PackedFunc

SaveToBinary/LoadFromBinary

TFModule









tvm::runtime::Module

#### Subclasses

### NPUModule

#### **NPU Driver**

### Device Drivers



### CUDAModule

#### **CUDA** Driver



**Runtime Module Interface** 

GetFunction(string) -> tvm::runtime::PackedFunc

SaveToBinary/LoadFromBinary

External Runtimes







# Unified Runtime Benefit

Unified library packaging

Free API (Py/Java/Go)

Automatic RPC Support

mod.export\_library("mylib.so")

```
lib = tvm.module.load("mylib.so")
func = lib["npufunction0"]
func(a, b)
```

```
remote = tvm.rpc.connect(board_url, port)
remote.upload("mylib.so")
remote_mod = remote.load_module("mylib.so")
func = remote_mod["npufunction0"]
func(remote_a, remote_b)
```



# Where are we going: Selected Topics

## **Unified Runtime**

## **Unified IR**

**Full-stack Automation** 

# Overview of New IR Infra



Unified module/pass, type system, with function variants support



runtime::Module

High-level optimizations

(Auto) Schedules Low-level optimizations

# Mixed Function Variants in the Same Module

- def @relay\_add\_one(%x : Tensor((10,), f32)) { call\_destination\_passing @te\_add\_one(%x, out=%b) }
- def @te\_add\_one(%a: NDArray, %b: NDArray) { var %n %A = decl\_buffer(shape=[%n], src=%a) %B = decl\_buffer(shape=[%n], src=%b) for %i = 0 to 10 [data\_par] { %B[%i] = %A[%i] + 1.0

mod = tvm.IRModule([te\_add\_one])
print(mod["te\_add\_one"].args)

ype="data\_par"):

mod = tvm.IRModule([te\_add\_one])
print(mod["te\_add\_one"].args)

### Use hybrid script as an alternative text format



ype="data\_par"):

mod = tvm.IRModule([te\_add\_one])

### Use hybrid script as an alternative text format

ype="data\_par"):

### Directly write pass, manipulate IR structures

mod = tvm.IRModule([te\_add\_one]) print(mod["te\_add\_one"].args)

Accelerate innovation,

Easy shift to C++ when product ready

### Use hybrid script as an alternative text format

ype="data\_par"):

### Directly write pass, manipulate IR structures

### e.g. use (GA/RL/BayesOpt/your favorite ML method) for AutoSchedule





### IRModule (relay::Function)

### IRModule (te::Function, ExternFunc, ...)

runtime::Module

# Rethink Low-level Tensor IR



### IRModule (relay::Function)

### IRModule (te::Function, ExternFunc, ...)

runtime::Module

# Rethink Low-level Tensor IR



#### IRModule (relay::Function)

#### IRModule (te::Function, ExternFunc, ...)

runtime::Module

## Rethink Low-level Tensor IR

Function as unit of transformation

Schedule transformation as pass

Better tensorization support





# Interpolate with Other ML Compiler Infra

TorchScript



**MLIR-TF** Function

#### IRModule

ExternFunc

Function in Other IR

#### te::Function

#### runtime::Module

ExternModule

### DSOModule

### relay::Function

### IR Translation

Custom Packaging

Custom codegen

# Where are we going: Selected Topics

### **Unified Runtime**

### **Unified IR**

### **Full-stack Automation**

# Full Stack Automation



#### High-Level Differentiable IR

#### Tensor Expression and Optimization Search Space

#### LLVM, CUDA, Metal













VTA

Cloud FPGA

ASIC



#### High-Level Differentiable IR

#### Tensor Expression and Optimization Search Space

#### LLVM, CUDA, Metal

















#### High-Level Differentiable IR

#### Tensor Expression and Optimization Search Space

#### LLVM, CUDA, Metal

















#### High-Level Differentiable IR

#### Tensor Expression and Optimization Search Space

#### LLVM, CUDA, Metal















## 2020 Projected Timeline: Selected Topics

### Unified IR Refactoring

#### **Unified Runtime** Unified IR Runtime RFC First Version

Jan



### First Release with New IR Infra

Documentation Benchmarking

Full Stack Automation

Oct

# 2020 Projected Timeline: Selected Topics Non comprehensive list of on-going topics

Unified IR Refactoring

#### Unified IR Unified Runtime Runtime RFC First Version

Jan



### First Release with New IR Infra

Documentation Benchmarking

Full Stack Automation

Oct

# 2020 Projected Timeline: Selected Topics

### Non comprehensive list of on-going topics

Ultra Low bits Gradient/Training

uTVM Standalone

Unified IR Refactoring

Unified IR Unified Runtime Runtime RFC First Version

Jan



- BERT TSIM AutoSchedule
- Dynamic Shape NPU coverage
  - First Release with New IR Infra
  - Documentation Benchmarking

Full Stack Automation

Oct

Community



Incubated as Apache TVM. Indeper allowing competitors to collaborate. Incubated as Apache TVM. Independent governance,



Incubated as Apache TVM. Independent governance, allowing competitors to collaborate.

# Open Source Code Open Development Open Governance



Incubated as Apache TVM. Indeper allowing competitors to collaborate. Incubated as Apache TVM. Independent governance,



Incubated as Apache TVM. Independent governance, allowing competitors to collaborate.

### **Growing Developer Community** 22 committers, 47 reviewers, 295 contributors



### **Growing Developer Community** 22 committers, 47 reviewers, 295 contributors

- Incubated as Apache TVM. Independent governance, allowing competitors to collaborate.

  - ~70% growth since TVM Conf 2018



### **Growing Developer Community** 22 committers, 47 reviewers, 295 contributors

### **Monthly Statistics** ~50 authors, ~140 PRs, ~1000 discuss forum posts

- Incubated as Apache TVM. Independent governance, allowing competitors to collaborate.

  - ~70% growth since TVM Conf 2018

# **EVM**.ai Sampl

### **Big THANKS to our sponsors!**













Semiconductor Research Corporation











| 9:00  | Keynote & Community Update<br>TVM @ AWS, FB |
|-------|---------------------------------------------|
| 11:10 | Break                                       |
| 11:30 | Compilers and VMs                           |
| 12:20 | <b>Boxed lunches - Contributors Meetup</b>  |
| 13:10 | Lightning talks                             |
| 13:40 | Hardware<br>TVM @ Microsoft, ARM, Xilinx    |
| 15:10 | Break                                       |
| 15:30 | Automation, new Hardware                    |
| 16:50 | Break                                       |
| 17:00 | Lightning talks                             |
| 18:10 | Social (food, drinks)                       |
| 20:00 | adjourn                                     |

Keynote (SAMPL, Qualcomm, Amazon, OctoML) TVM @ AWS – Yida Wang, Amazon TVM @ FB — Andrew Tulloch and Bram Wasti, Facebook

Al Compilers at Alibaba – Yangqing Jia, Alibaba Dynamic Execution and VMs, Jared Roesch and Haichen Shen, UW and AWS

Building FPGA-Targeted Accelerators with HeteroCL – Zhiru Zhang, Cornell TVM @ Microsoft – Jon Soifer and Minjia Zhang TVM @ ARM – Ramana Radhakrishnan TVM @ Xilinx – Elliott Delaye

TVM @ OctoML – Jason Knight TVM @ Qualcomm – Krzysztof Parzyszek TASO: Optimizing Deep Learning Computation with Automated Generation of Graph Substitutions – Zhihao Jia, Stanford Talk by Nilesh Jain, Intel Labs





