Oct 13th, 2022, 11:30 am - 12:30 pm PT
TC-GNN: Accelerating Sparse Graph Neural Network Computation via Dense Tensor Core on GPUs
Speaker:
Yuke Wang (UCSB)
Abstract
Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings.
Speaker bio
Yuke Wang is a fifth-year Doctor of Philosophy (Ph.D.) candidate in the department of computer science at the University of California, Santa Barbara (UCSB). He got his Bachelor of Engineering (B.E.) in software engineering from the University of Electronic Science and Technology of China (UESTC) in 2018.
At UCSB, Yuke is working with Prof Yufei Ding. Yuke's research interests include high-performance computing and deep learning algorithms. His recent ongoing projects cover graph neural network (GNN) optimization and its acceleration on GPUs. Yuke is also the recipient of NVIDIA Graduate Fellowship 2022 - 2023.
Oct 18th, 2022, 12:30 pm - 13:30 pm PT
The Sparse Abstract Machine: Sparse Tensor Algebra as Dataflow Graphs
Speaker:
Olivia Hsu (Stanford)
Abstract
This talk is relatively new work on compiling sparse tensor algebra to dataflow hardware and accelerators, in collaboration with MIT and Stanford University. We propose the Sparse Abstract Machine (SAM), an intermediate representation and abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming abstraction with sparse primitives that encompass fused sparse tensor algebra expressions for arbitrary dataflow. SAM dataflow graphs naturally separate tensor formats from algorithms and is expressive enough to incorporate many sparse-iteration and hardware-specific optimizations. We show an automatic compilation technique from a high-level language to SAM. We also show how SAM can be leveraged to develop new hardware for sparse tensor algebra.
Speaker bio
Olivia is a computer science PhD student at Stanford University advised by Professor Kunle Olukotun and Professor Fredrik Kjolstad. She currently works on mapping sparse applications to domain-specific architectures, reconfigurable dataflow hardware, and accelerators through the TACO compiler. Her research interests broadly include computer architecture, computer and programming systems, compilers, programming models and languages, and digital circuits/VLSI.
May 5th, 2022, 12:00 pm - 1:30 pm PT
Programming Abstractions and Efficient Compilation Techniques for Modern FPGAs
Speaker:
Luis Vega (UW)
Abstract
Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past— they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable lookup tables. Nowadays, FPGAs are highly heterogeneous architectures that support a wide variety of compute operations such as scalar, vector, fused, and floating-point arithmetic together with different kinds of programmable memories. Unfortunately, the behavioral semantics available in hardware languages today cannot effectively capture these architectural advances, resulting in inefficient programs that are missing all the benefits of specialization. An example of this abstraction gap is that vector operations cannot be described behaviorally for targeting vector hardware (SIMD) available in modern FPGAs.
This thesis describes Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. The design goal of the intermediate language is to describe behavior, while the assembly language aims for layout. Furthermore, I demonstrate how to lower intermediate programs to assembly programs, using instruction selection which can be both faster and deterministic compared to existing technology mapping approaches. I use Reticle to implement compute centric benchmarks, such as linear algebra operators and coroutines, and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization. Additionally, I show how using Reticle’s memory instructions can lead to 5.26𝑥 performance improvement on an existing encryption application (AES).
May 12th, 2022, 11:30 am - 12:30 pm PT
Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation
Speaker:
Jiawei Liu (UIUC)
Abstract
In the past decade, Deep Learning (DL) systems have been widely deployed in various application domains to facilitate our daily life, e.g., natural language processing, healthcare, activity recognition, and autonomous driving. Meanwhile, it is extremely challenging to ensure the correctness of DL systems (e.g., due to their intrinsic nondeterminism), and bugs in DL systems can cause serious consequences and may even threaten human lives. In the literature, researchers have explored various techniques to test, analyze, and verify DL models, since their quality directly affects the corresponding system behaviors. Recently, researchers have also proposed novel techniques for testing the underlying operator-level DL libraries, which provide general binary implementations for each high-level DL operator and are the foundation for running DL models on different hardware platforms. However, there is still limited work targeting the reliability of the emerging tensor compilers (also known as DL compilers), which aim to automatically compile high-level tensor computation graphs directly into high-performance binaries for better efficiency, portability, and scalability than traditional operator-level libraries. In this talk, I'll introduce Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Tzer focuses on mutating the low-level Intermediate Representation (IR) for TVM due to the limited mutation space for the high-level IR. Our experimental results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed (PR merged).
Speaker bio
Jiawei is a first-year CS PhD student at UIUC advised by Lingming Zhang. His primary research goal is to make future software infrastructures: easy-to-use, high-performance and reliable. At present, He is developing PLSE techniques to make ML Systems reliable and efficient.
Recording:
public
May 16th, 2022, 8:30 am - 10:20 am PT
Exploiting Parallelism in Large Scale Deep Learning Model Training: From Chips to Systems to Algorithms
Speaker:
Saurabh Kulkarni (GraphCore)
Abstract
We live in a world where hyperscale systems for machine intelligence are increasingly being used to solve complex problems ranging from natural language processing to computer vision to molecular modeling, drug discovery and recommendation systems. A convergence of breakthrough research in machine learning models and algorithms, increased accessibility to hardware systems at cloud scale for research and thriving software ecosystems are paving the way for an exponential increase in model sizes. Effective parallel processing and model decomposition techniques and large clusters of accelerators will be required to train these models of the future economically.
Attend this session to learn about how Graphcore aims to address scale challenges associated with training large models. Get to know our Intelligent Processing Unit (IPU) – a purpose-built hardware accelerator with a unique MIMD architecture – designed to address the most demanding compute and memory bandwidth needs of modern ML models. Our network disaggregated architecture uniquely positions us to build highly scalable systems (IPU-PODs) with thousands of accelerators aimed at exploiting various dimensions of parallelism.
Speaker bio
Saurabh Kulkarni is Head of Engineering for North America at Graphcore. Over the last 20 years, he has held various leadership positions at Intel, Microsoft, and Oracle prior to his current role at Graphcore. His roles have spanned a variety of domains, including computer architecture, server platform Architecture, cloud infrastructure, and hardware accelerators for AI/ML.
Recording:
public
May 26, 2022, 4:30 pm - 5:30 pm PT
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Speaker:
Yu Tang (National University of Defense Technology (NUDT))
Abstract
With the development of large-scale deep neural networks, significant success has been achieved in various domains. Nevertheless, the further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain: 1) The efficiency of recomputation is limited for both static and dynamic methods. 2) Swapping requires researchers to offload parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA (Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we firstly propose a filter algorithm to select the optimal tensors to be released out of GPU memory and secondly present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent, but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04× maximum batchsize when training ResNet-50 and 2.25× when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.
Speaker bio
Yu Tang received his M.S. and B.S. degrees in Computer Science from National University of Defense Technology (NUDT) in 2020 and 2018 respectively, where he is currently pursuing the doctor's degree. His current research interests include distributed machine learning, memory optimization, training optimization of large-scale models, and alternating direction method of multipliers (ADMM). Now, he is an intern in Shanghai AI lab.
Recording:
internal
Apr 14th, 2022
Compiler and Runtime Techniques for Optimizing Deep Learning Applications
Speaker:
Steven Lyubomirsky (UW)
Abstract
As the scaling and performance demands for deep learning systems have grown, system designers have struggled to incorporate innovations at opposite ends of the system stack: more varied and complex deep learning models and specialized hardware accelerators. New models that use data structures and dynamic control flow to address new learning problems cannot immediately benefit from previous system-level optimizations, which are defined over static dataflow graphs. Meanwhile, many novel hardware accelerators for accelerating common deep learning operations present unusual computing models and often require manual modification of applications to use, demanding expertise in both the deep learning domain and in hardware. The challenges in adding support for accelerators in existing compiler stacks slow development cycles and constrain deep learning systems' capabilities and efficiency.
Following earlier work on the Relay IR for the TVM framework, this dissertation demonstrates that system design problems in the deep learning domain can be approached by formalizing deep learning models as programs broadly (rather than assuming a more specific structure like a graph) and applying traditional compiler engineering techniques, simplifying various optimizations and transformations. In particular, this work addresses the use of runtime systems to support optimizations for dynamic deep learning models and on systematically supporting accelerators through the use of a formal software/hardware interface. Traditional deep learning model optimizations have been conceived as transformations on static dataflow graphs, but can be adapted to perform similar reasoning dynamically (and hence make no assumptions about control flow) by performing similar reasoning in a runtime system, guided by heuristics that depend on dynamically gathered information. This work details the specific example of Dynamic Tensor Rematerialization, which is an online approach to the problem of gradient checkpointing (recomputing intermediate activations instead of storing them to reduce the memory required for training) that achieves results comparable to optimal static techniques but generalizes to arbitrarily dynamic models. In addressing the problem of supporting accelerators in deep learning compiler stacks, this work demonstrates that a formal software/hardware interface enables traditional compiler techniques like instruction selection to be adapted for accelerators. Namely, this work presents a methodology for implementing a compiler stack with extensible support for accelerators that uses term rewriting to automatically discover opportunities to apply accelerator operations and lays the foundations for extending formal verification to entire compilation stacks with accelerator support.
Recording:
public
Note: Steven's PhD Defense talk, Congrats!