SAMPL Lunch Talks

Links

Join the following channels for updates and notifications:

Schedule

Upcoming Talks

Past Talks

TC-GNN: Accelerating Sparse Graph Neural Network Computation via Dense Tensor Core on GPUs
Speaker: Yuke Wang (UCSB)
Abstract Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings.
Speaker bio Yuke Wang is a fifth-year Doctor of Philosophy (Ph.D.) candidate in the department of computer science at the University of California, Santa Barbara (UCSB). He got his Bachelor of Engineering (B.E.) in software engineering from the University of Electronic Science and Technology of China (UESTC) in 2018. At UCSB, Yuke is working with Prof Yufei Ding. Yuke's research interests include high-performance computing and deep learning algorithms. His recent ongoing projects cover graph neural network (GNN) optimization and its acceleration on GPUs. Yuke is also the recipient of NVIDIA Graduate Fellowship 2022 - 2023.
The Sparse Abstract Machine: Sparse Tensor Algebra as Dataflow Graphs
Speaker: Olivia Hsu (Stanford)
Abstract This talk is relatively new work on compiling sparse tensor algebra to dataflow hardware and accelerators, in collaboration with MIT and Stanford University. We propose the Sparse Abstract Machine (SAM), an intermediate representation and abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming abstraction with sparse primitives that encompass fused sparse tensor algebra expressions for arbitrary dataflow. SAM dataflow graphs naturally separate tensor formats from algorithms and is expressive enough to incorporate many sparse-iteration and hardware-specific optimizations. We show an automatic compilation technique from a high-level language to SAM. We also show how SAM can be leveraged to develop new hardware for sparse tensor algebra.
Speaker bio Olivia is a computer science PhD student at Stanford University advised by Professor Kunle Olukotun and Professor Fredrik Kjolstad. She currently works on mapping sparse applications to domain-specific architectures, reconfigurable dataflow hardware, and accelerators through the TACO compiler. Her research interests broadly include computer architecture, computer and programming systems, compilers, programming models and languages, and digital circuits/VLSI.
Programming Abstractions and Efficient Compilation Techniques for Modern FPGAs
Speaker: Luis Vega (UW)
Abstract Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past— they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable lookup tables. Nowadays, FPGAs are highly heterogeneous architectures that support a wide variety of compute operations such as scalar, vector, fused, and floating-point arithmetic together with different kinds of programmable memories. Unfortunately, the behavioral semantics available in hardware languages today cannot effectively capture these architectural advances, resulting in inefficient programs that are missing all the benefits of specialization. An example of this abstraction gap is that vector operations cannot be described behaviorally for targeting vector hardware (SIMD) available in modern FPGAs. This thesis describes Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. The design goal of the intermediate language is to describe behavior, while the assembly language aims for layout. Furthermore, I demonstrate how to lower intermediate programs to assembly programs, using instruction selection which can be both faster and deterministic compared to existing technology mapping approaches. I use Reticle to implement compute centric benchmarks, such as linear algebra operators and coroutines, and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization. Additionally, I show how using Reticle’s memory instructions can lead to 5.26𝑥 performance improvement on an existing encryption application (AES).
Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation
Speaker: Jiawei Liu (UIUC)
Abstract In the past decade, Deep Learning (DL) systems have been widely deployed in various application domains to facilitate our daily life, e.g., natural language processing, healthcare, activity recognition, and autonomous driving. Meanwhile, it is extremely challenging to ensure the correctness of DL systems (e.g., due to their intrinsic nondeterminism), and bugs in DL systems can cause serious consequences and may even threaten human lives. In the literature, researchers have explored various techniques to test, analyze, and verify DL models, since their quality directly affects the corresponding system behaviors. Recently, researchers have also proposed novel techniques for testing the underlying operator-level DL libraries, which provide general binary implementations for each high-level DL operator and are the foundation for running DL models on different hardware platforms. However, there is still limited work targeting the reliability of the emerging tensor compilers (also known as DL compilers), which aim to automatically compile high-level tensor computation graphs directly into high-performance binaries for better efficiency, portability, and scalability than traditional operator-level libraries. In this talk, I'll introduce Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Tzer focuses on mutating the low-level Intermediate Representation (IR) for TVM due to the limited mutation space for the high-level IR. Our experimental results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed (PR merged).
Speaker bio Jiawei is a first-year CS PhD student at UIUC advised by Lingming Zhang. His primary research goal is to make future software infrastructures: easy-to-use, high-performance and reliable. At present, He is developing PLSE techniques to make ML Systems reliable and efficient.
Recording: public
Exploiting Parallelism in Large Scale Deep Learning Model Training: From Chips to Systems to Algorithms
Speaker: Saurabh Kulkarni (GraphCore)
Abstract We live in a world where hyperscale systems for machine intelligence are increasingly being used to solve complex problems ranging from natural language processing to computer vision to molecular modeling, drug discovery and recommendation systems. A convergence of breakthrough research in machine learning models and algorithms, increased accessibility to hardware systems at cloud scale for research and thriving software ecosystems are paving the way for an exponential increase in model sizes. Effective parallel processing and model decomposition techniques and large clusters of accelerators will be required to train these models of the future economically. Attend this session to learn about how Graphcore aims to address scale challenges associated with training large models. Get to know our Intelligent Processing Unit (IPU) – a purpose-built hardware accelerator with a unique MIMD architecture – designed to address the most demanding compute and memory bandwidth needs of modern ML models. Our network disaggregated architecture uniquely positions us to build highly scalable systems (IPU-PODs) with thousands of accelerators aimed at exploiting various dimensions of parallelism.
Speaker bio Saurabh Kulkarni is Head of Engineering for North America at Graphcore. Over the last 20 years, he has held various leadership positions at Intel, Microsoft, and Oracle prior to his current role at Graphcore. His roles have spanned a variety of domains, including computer architecture, server platform Architecture, cloud infrastructure, and hardware accelerators for AI/ML.
Recording: public
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Speaker: Yu Tang (National University of Defense Technology (NUDT))
Abstract With the development of large-scale deep neural networks, significant success has been achieved in various domains. Nevertheless, the further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain: 1) The efficiency of recomputation is limited for both static and dynamic methods. 2) Swapping requires researchers to offload parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA (Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we firstly propose a filter algorithm to select the optimal tensors to be released out of GPU memory and secondly present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent, but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04× maximum batchsize when training ResNet-50 and 2.25× when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.
Speaker bio Yu Tang received his M.S. and B.S. degrees in Computer Science from National University of Defense Technology (NUDT) in 2020 and 2018 respectively, where he is currently pursuing the doctor's degree. His current research interests include distributed machine learning, memory optimization, training optimization of large-scale models, and alternating direction method of multipliers (ADMM). Now, he is an intern in Shanghai AI lab.
Recording: internal
Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads
Speaker: Chen-Yu Ho (KAUST)
Abstract Deep learning-based solutions achieve significant advancements in tasks such as natural language processing, image classification, and recommendation. As more sophisticated models are developed, the increasing training time and memory footprint force practitioners to use distributed training. State-of-the-art distributed training algorithms use iterative synchronization among participating nodes to ensure model consistency and correctness, putting a heavy burden on network communication. Furthermore, hardware accelerators are improving at a faster rate than network bandwidth growth. As a result, the communication phase of distributed training is frequently the bottleneck. In this talk, I will discuss three approaches to dealing with communication bottlenecks: application-level, network-level, and co-design solutions.
Speaker bio Chen-Yu is a 4th year Ph.D. student at KAUST. Combining his interests in fundamental systems and the trend of machine learning, Chen-Yu is collaborating with colleagues on developing efficient distributed machine learning systems, to be specific, trying to alleviate network bandwidth bottleneck by offloading aggregation operations to network devices.
Recording: public
Verified Tensor-Program Optimization Via High-Level Scheduling Rewrites
Speaker: Amanda Liu (MIT)
Abstract We present a lightweight Coq framework for optimizing tensor kernels written in a pure, functional array language. Optimizations rely on user scheduling using a series of verified, semantics-preserving rewrites. Unusually for compilation targeting imperative code with arrays and nested loops, all rewrites are source-to-source within a purely functional language. Our language comprises a set of core constructs for expressing high-level computation detail and a set of what we call reshape operators, which can be derived from core constructs but trigger low-level decisions about storage patterns and ordering. We demonstrate that not only is this system capable of deriving the optimizations of existing state-of-the-art languages like Halide and generating comparably performant code, it is also able to schedule a family of useful program transformations beyond what is reachable in Halide.
Speaker bio Amanda is a second-year PhD student working with Prof. Adam Chlipala and Prof. Jonathan Ragan-Kelley. Her interests are using formal methods, programming languages, and types to develop verified, principled methods for writing high-performance systems.
Recording: public
Compiler and Runtime Techniques for Optimizing Deep Learning Applications
Speaker: Steven Lyubomirsky (UW)
Abstract As the scaling and performance demands for deep learning systems have grown, system designers have struggled to incorporate innovations at opposite ends of the system stack: more varied and complex deep learning models and specialized hardware accelerators. New models that use data structures and dynamic control flow to address new learning problems cannot immediately benefit from previous system-level optimizations, which are defined over static dataflow graphs. Meanwhile, many novel hardware accelerators for accelerating common deep learning operations present unusual computing models and often require manual modification of applications to use, demanding expertise in both the deep learning domain and in hardware. The challenges in adding support for accelerators in existing compiler stacks slow development cycles and constrain deep learning systems' capabilities and efficiency. Following earlier work on the Relay IR for the TVM framework, this dissertation demonstrates that system design problems in the deep learning domain can be approached by formalizing deep learning models as programs broadly (rather than assuming a more specific structure like a graph) and applying traditional compiler engineering techniques, simplifying various optimizations and transformations. In particular, this work addresses the use of runtime systems to support optimizations for dynamic deep learning models and on systematically supporting accelerators through the use of a formal software/hardware interface. Traditional deep learning model optimizations have been conceived as transformations on static dataflow graphs, but can be adapted to perform similar reasoning dynamically (and hence make no assumptions about control flow) by performing similar reasoning in a runtime system, guided by heuristics that depend on dynamically gathered information. This work details the specific example of Dynamic Tensor Rematerialization, which is an online approach to the problem of gradient checkpointing (recomputing intermediate activations instead of storing them to reduce the memory required for training) that achieves results comparable to optimal static techniques but generalizes to arbitrarily dynamic models. In addressing the problem of supporting accelerators in deep learning compiler stacks, this work demonstrates that a formal software/hardware interface enables traditional compiler techniques like instruction selection to be adapted for accelerators. Namely, this work presents a methodology for implementing a compiler stack with extensible support for accelerators that uses term rewriting to automatically discover opportunities to apply accelerator operations and lays the foundations for extending formal verification to entire compilation stacks with accelerator support.
Recording: public
Note: Steven's PhD Defense talk, Congrats!
Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning
Speaker: Lianmin Zheng (UC Berkeley)
Abstract Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
Speaker bio Lianmin is a third-year Ph.D. student in the EECS department at UC Berkeley, advised by Ion Stoica and Joseph E. Gonzalez. His research interests lie in the intersection of machine learning and programming systems, especially domain-specific compilers for accelerated and scalable deep learning.
Recording: public
Efficient Batching Techniques for Dynamic Deep Learning
Speaker: Pratik Fegade (CMU)
Speaker bio Pratik is a PhD student in the Computer Science Department at CMU and he works with Prof. Todd Mowry, Prof. Phil Gibbons and Prof. Tianqi Chen. His current research focus is on building better compilation and execution stacks for handling dynamism in deep learning models. In the past, He has worked on building compiler analysis techniques that understand and optimize programs written in general-purpose programming languages at semantically higher levels than is currently possible.
Recording: internal
Accessible and Scalable Transformers through 8-bit Matrix Multiplication and 8-bit Optimizers
Speaker: Tim Dettmers (UW)
Speaker bio Tim Dettmers is a PhD student at the University of Washington advised by Luke Zettlemoyer working on representation learning, and neuro-inspired and hardware optimized deep learning. Previously he interned at the UCL Machine Reading Group where he was advised by Sebastian Riedel working on information retrieval and link prediction in knowledge graphs. He did his master in computer science at the University of Lugano.
Recording: internal
Resource-Efficient Execution of Deep Learning Computations
Speaker: Deepak Narayanan (Microsoft Research)
Abstract Deep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two ideas that help improve the resource efficiency of model training. In the first half of the talk, I will discuss how pipelining can be used to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute resource utilization. Pipeline parallelism also enables the efficient training of large models that do not fit on a single worker; for example, we used pipeline parallelism at Nvidia to efficiently scale training to language models with a trillion parameters on 3000+ GPUs. In the second half of this talk, I will describe how resources in a shared cluster with heterogeneous compute resources (e.g., different types of hardware accelerators) should be partitioned among different users to optimize objectives specified over one or more training jobs. Heterogeneity-aware scheduling can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x.
Speaker bio Deepak is a Senior Researcher in the Systems group at Microsoft Research Redmond. His broad research interests include distributed systems and cloud computing. In particular, he is interested in the Systems problems associated with learning and deploying machine learning models at scale. He graduated from Stanford with a Ph.D. in Computer Science in September 2021, where he was advised by Prof. Matei Zaharia.
Recording: public
Synthesizing Programmable Accelerators: A Compiler’s Perspective
Speaker: Jian Weng (UCLA)
Abstract Because of the waning benefit of transistor scaling, specialized accelerators emerge, and already achieved big success in both industries and academics. However, all these accelerators require intensive human effort to design the hardware itself as well as the ISA and software stack, which can hardly justify designing a specialized accelerator for each domain of interest. Our work makes a very first attempt to automate this process. In this talk, I will present an automated, and program-behavior-centric paradigm for full-stack programmable accelerator design.
Speaker bio Jian is a 5th-year Ph.D. Candidate from UCLA under the guidance of Prof. Tony Nowatzki. His research interests mainly lie in designing and analyzing specialized accelerators and their associated compilation technologies.
Recording: public
Overview of Sparse TIR project
Speaker: Zihao Ye (UW)
Recording: internal
Overview of TIR project
Speaker: Ruihang Lai (SJTU)
Speaker bio Ruihang is an undergraduate student at Shanghai Jiao Tong University, he worked on the Apache TVM project with Tianqi Chen. His research interests include Machine Learning Systems and Deep Learning Compilers.
Recording: public
Overview of Relax Project
Speaker: Andrew Liu (UW)
Large-scale GNN training with DGL
Speaker: Da Zheng (AWS AI)
Abstract Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To scale graph neural network training on large graphs, we adopt hybrid CPU/GPU mini-batch training, in which we store graph data and sample nodes and their neighbors in CPU, and perform mini-batch computation in GPUs. In this talk, I will discuss the optimizations for GNN mini-batch training in two aspects. First, I will discuss our effort of scaling GNN training to a cluster of CPU and GPUs. We develop multiple optimizations to address the challenges in distributed hybrid CPU/GPU training (reduce data movement and balance the load in mini-batch computation). With these optimizations, we show close to good speedup without compromising model accuracy and train GNN models on a graph with 100M nodes with less than 1 minute in a cluster of 32 GPUs. In the second part, I will discuss a new neighbor sampling algorithm called global neighbor sampling (GNS) to reduce the data copy from CPU to GPUs. This algorithm efficiently samples neighbor nodes that are already stored in a GPU cache to reduce data copy from CPU to GPU. We show that our neighbor sampling algorithm can achieve state-of-the-art model performance while speeding the mini-batch training by a factor of 2 to 14 compared with the previous state-of-the-art algorithms.
Speaker bio Da Zheng is a senior applied scientist at AWS AI, where he leads the project Deep Graph Library and DGL-KE for graph neural networks and knowledge graphs. His research interest covers a wide range of areas, including high-performance computing, large-scale data analysis systems, data mining and machine learning. He got a PhD from the department of computer science at Johns Hopkins University. During his PhD, he worked on FlashGraph and FlashR, frameworks for large-scale graph analysis and data analysis on solid-state drives (SSDs).
Recording: public
Decoupling Algorithm from Hardware Customizations for Software-Defined Reconfigurable Computing
Speaker: Yi-Hsiang (Sean) Lai (Cornell)
Abstract With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with spatial accelerators such as FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. In this talk, I will first present SuSy, a programming framework composed of a domain-specific language (DSL) and a compilation flow that enables programmers to productively build high-performance systolic arrays on FPGAs. With SuSy, programmers express the design functionality in the form of uniform recurrence equations (UREs). The URE description in SuSy is followed by a set of decoupled spatial mapping primitives that specify how to map the equations to a spatial architecture. More concretely, programmers can apply space-time transformations and several other memory and I/O optimizations to build a highly efficient systolic architecture productively. After that, I will present HeteroCL, an open-source programming infrastructure composed of a Python-based domain-specific language and an FPGA-targeted compilation flow. Similar to SuSy, HeteroCL cleanly decouples algorithm specifications from three important types of hardware customization in compute, data types, and memory architectures. In addition, HeteroCL produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures.
Speaker bio Yi-Hsiang Lai is currently a 6th-year Ph.D. student at Cornell advised by Prof. Zhiru Zhang. He received both his Master's and Bachelor's degrees in Electrical Engineering from National Taiwan University. His research focuses on high-level synthesis for FPGAs, programming models, and compilers.
Recording: public
Autotuning Production Machine Learning Compilers
Speaker: Mangpo Phothilimthana (Google Research)
Abstract Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by several limitations. In this talk, I will present an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations at multiple compilation stages. The autotuner applies a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. The autotuner tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. We demonstrate how to incorporate machine learning techniques such as a learned cost model and various learning-based search strategies to reduce autotuning time. Our learned cost model has high accuracy and outperforms a heavily-optimized analytical performance model. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs), the autotuner offers up to 2.4x and an average 5% runtime speedup over the heavily-optimized XLA compiler. The autotuner has been deployed to automatically tune the most heavily-used production models in Google’s fleet everyday.
Speaker bio Mangpo is a research scientist at Google Brain, where she leads Machine Learning for Machine Learning Compilers effort (one of Google Brain moonshots in 2020). Her research interests include compilers, machine learning for systems, program synthesis, and efficient computing. Mangpo completed her PhD in Computer Science at UC Berkeley. Her dissertation focuses on synthesis-aided compilation and programming models for emerging architectures, ranging from an ultra-low-power processor to a programmable network card.
Recording: public