SAMPL Lunch Talks

SAMPL organizes group lunch on a weekly basis and we invite speakers from both academia and industry to present their work. The goal is to provide a platform for researchers to share their work and to foster collaborations. The talks are open to everyone in the community. If you are interested in giving a talk, please contact the organizers.

Organizers

Related Seminars

Sometimes SAMPL talks are combined with other seminars, please check the following links for more information:

Links

Join the following channels for updates and notifications:

Schedule

Upcoming Talks

Past Talks

Approximate Caching for Efficiently Serving Text-to-Image Diffusion Models
Speaker: Shubham Agarwal (Adobe Research, India)
Abstract Text-to-image generation using diffusion models has gained explosive popularity due to their ability to produce high-quality images adhering to text prompts. However, diffusion models undergo a large number of iterative denoising steps and are resource-intensive, requiring expensive GPUs and incurring considerable latency. In this paper, we introduce a novel approximate-caching technique that reduces these iterative denoising steps by reusing intermediate noise states created during a prior image generation. Based on this idea, we present an end-to-end text-to-image generation system, NIRVANA, that employs approximate-caching with a novel cache management policy to achieve 21% GPU compute savings, 19.8% end-to-end latency reduction, and 19% cost savings on two real production workloads. Additionally, we provide an extensive characterization of real production text-to-image prompts from the perspectives of caching, popularity, and reuse of intermediate states in a large production environment.
Speaker bio Shubham is currently a pre-doctoral researcher at Adobe Research, India, focusing on inference optimization for text-to-image models and LLMs using approximate caching, efficient scheduling, and resource management. He works on enhancing the efficiency of generative models at both the algorithmic and hardware levels. He has first-author publications in NSDI, ECCV, and FSE, and has co-authored papers in WWW, ASE, and PAKDD. In addition to research, Shubham has contributed code to production systems at Adobe. He completed his Bachelor's degree in Computer Science from BITS Pilani in 2022. He is passionate about efficiency in large AI models and is planning to pursue a PhD next fall. Outside of work, he enjoys long drives and cycling.
Towards Fast, Adaptive, and Hardware-Assisted User-Space Scheduling
Speaker: Lisa Li (Cornell University/MIT)
Abstract Scaling application performance while improving system resource efficiencies has become an increasingly important agenda for both cloud providers and users. With the rise of emerging interactive applications such as LLM and microservices, users must build applications that satisfy microsecond scale tail latency service level objectives; and with emphasis on sustainability, cloud providers aim to improve user experience while reducing their resource footprints. In this talk, I will discuss the design of general-purpose, efficient, and adaptive frameworks for resource allocation and scheduling. First, I will introduce LibPreemptible, a fast, scalable, and hardware-assisted user-space scheduling library that is designed for microsecond-scale workloads. If time permits, I will discuss some ongoing works on efficient LLM serving scheduling systems and frameworks for cloud reliability.
Speaker bio Lisa Li is a CS PhD student at Cornell, with a current focus on sustainability, efficiency and reliability on scheduling problems in cloud computing. She also works on reinforcement learning and efficient LLM serving. She worked at Apple designing CPUs in California after graduation from SJTU and UM, deferring offers of PhD. She graduated with the highest honor in ECE, CE, and a minor in Math. She is passionate about mentorship in the CS community and serves on the CASA committee and CALM committee (a long-term mentorship program).
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Speaker: Junda Chen (University of California, San Diego)
Speaker bio Junda Chen is a 1st year PhD student at Dept. Computer Science and Engineering at UC San Diego
Intelligent Software in the Era of Deep Learning
Speaker: Yuke Wang (University of California, Santa Barbara)
Abstract With the end of Moore’s Law and the rise of compute- and data-intensive deep-learning (DL) applications, the focus on arduous new processor design has shifted towards a more effective and agile approach -- Intelligent Software to maximize the performance gains of DL hardware like GPUs. In this talk, I will first highlight the importance of software innovation to bridge the gap between the increasingly diverse DL applications and the existing powerful DL hardware platforms. The second part of my talk will recap my research work on DL system software innovation, focusing on bridging the 1) Precision Mismatch between DL applications and high-performance GPU units like Tensor Cores (PPoPP ’21, SC ’21, and ATC’23), and 2) Computing Pattern Mismatch between the sparse and irregular DL applications such as Graph Neural Networks and the dense and regular tailored GPU computing paradigm (OSDI ’21 and OSDI ’23). Finally, I will conclude this talk with my vision and future work for building efficient, scalable, and secure DL systems.
Speaker bio Yuke Wang is the incoming assistant professor at Rice University CS Department in Fall 2025
Rail-only: A Low-Cost High-Performance Network for Training LLMs with Trillion Parameters
Speaker: Weiyang Wang (MIT)
Abstract This paper presents a low-cost network architecture for training large language models (LLMs) at hyperscale. We study the optimal parallelization strategy of LLMs and propose a novel datacenter network design tailored to LLM's unique communication pattern. We show that LLM training generates sparse communication patterns in the network and, therefore, does not require any-to-any full-bisection network to complete efficiently. As a result, our design eliminates the spine layer in traditional GPU clusters. We name this design a Rail-only network and demonstrate that it achieves the same training performance while reducing the network cost by 38% to 77% and network power consumption by 37% to 75% compared to a conventional GPU datacenter. Our architecture also supports Mixture-of-Expert (MoE) models with all-to-all communication through forwarding, with only 4.1% to 5.6% completion time overhead for all-to-all traffic. We study the failure robustness of Rail-only networks and provide insights into the performance impact of different network and training parameters.
Speaker bio Weiyang Wang is a PhD student at MIT CSAIL. He interned at MSR during the summer of 2024.
Optimal Kernel Orchestration for Tensor Programs with Korch
Speaker: Muyan Hu (University of Illinois Urbana-Champaign)
Abstract Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7× on V100 GPUs and up to 1.6× on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.
Speaker bio Muyan Hu is a PhD student at UIUC advised by Vikram Adve and Charith Mendis.
Recording: public
Accelerating Large Language Model Inference on FPGA with Allo
Speaker: Hongzheng Chen (Cornell University)
Abstract As the benefits of technology scaling diminish, specialized hardware accelerators are crucial for performance in emerging applications. However, designers currently lack effective tools and methodologies to construct complex, high-performance accelerator architectures. Existing high-level synthesis (HLS) tools often require intrusive source-level changes to attain satisfactory quality of results. While new accelerator design languages (ADLs) aim to enhance or replace HLS, they are typically more effective for simple applications with a single kernel, rather than for hierarchical designs with multiple kernels. In the first part of this talk, I will introduce Allo, a composable programming model for efficient hardware accelerator design (to appear in PLDI’24). Allo decouples hardware customizations, including compute, memory, communication, and data types from algorithm specification, and encapsulates them as a set of customization primitives. Allo also preserves the hierarchical structure of an input program by combining customizations from different functions in a bottom-up, type-safe manner. Our evaluation shows that Allo can outperform state-of-the-art HLS tools and ADLs on all test cases in the PolyBench. Furthermore, I will demonstrate how Allo optimizes large-scale designs by composing optimized individual kernels. I will use a spatial architecture for large language models (LLMs) as an example. This accelerator implements a design point of our analytical model presented in FCCM’24, where we introduce a comprehensive analytical framework for estimating the performance of a spatial LLM accelerator. Through this analysis, we can identify the most effective parallelization and buffering schemes for the accelerator and, crucially, determine the scenarios in which FPGA-based spatial acceleration can outperform its GPU-based counterpart. For GPT generative inference, our accelerator attains a 2.2× speedup compared to DFX, an FPGA overlay, in the prefill stage, while achieving a 1.9× speedup and a 5.7× improvement in energy efficiency compared to the NVIDIA A100 GPU in the decode stage.
Speaker bio Hongzheng Chen is a third-year Ph.D. student at Cornell University supervised by Prof. Zhiru Zhang. His research interests broadly lie in domain-specific languages and compilers, efficient runtime systems, and accelerator architecture. He is currently working on compiler optimizations for large-scale heterogeneous computing systems with a special focus on accelerating deep learning applications. He has published several papers on top-tier computer systems & hardware conferences including ASPLOS, PLDI, SC, FPGA, and ICCAD.
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Speaker: Keisuke Kamahori (University of Washington)
Abstract Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over 3 tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler
Speaker bio Keisuke Kamahori is a first-year Ph.D. student at Paul G. Allen School of Computer Science & Engineering, University of Washington, advised by Baris Kasikci. He is broadly interested in computer systems and architecture, with a focus on system for LLMs recently. Prior to that, he received B. Sc. in Information Science from the University of Tokyo in 2023, advised by Shinya Takamaeda-Yamazaki. He also worked with James Larus at EPFL in the summer of 2022.
Ray (Data): A scalable and flexible ML toolkit
Speaker: Stephanie Wang (University of Washington & Anyscale)
Abstract Ray is a framework for scaling ML and Python applications. In this talk, I'll explain the motivation and a brief history of Ray. The system consists of a general-purpose distributed execution *Core* and a collection of distributed libraries that are designed for specific domains common to end-to-end ML applications, such as distributed training or inference. I'll also give a deep dive of Ray Data, one of the key libraries that supports data processing for ML. Finally, I'll give an update on our current work in reducing system-level overheads in Ray to enable fine-grained orchestration of distributed accelerators. The goal is to reduce developer burden in building high-performance distributed systems, such as for LLM inference. If we have time, we'll do a short tutorial together on Ray Data. If you'd like to follow along, please come prepared with a Python 3.10 environment and install Ray with `pip install -U ray` (more installation instructions here: https://docs.ray.io/en/latest/ray-overview/installation.html).
Speaker bio Stephanie is an incoming assistant professor in computer science at the University of Washington, starting in fall 2024. If you are interested in distributed systems, systems for machine learning and data processing, programming languages, and/or how these topics fit together. Previously, She was a PhD student in the RISELab at UC Berkeley, where she was advised by Ion Stoica. She is also a co-creator and committer for the open-source project, Ray, which has been used to train ChatGPT, serve high-performance LLMs, and break the CloudSort 100TB record. At the moment, She is continuing to develop the Ray ecosystem as a software engineer at Anyscale, working primarily on Ray Data, a system for distributed data preprocessing for ML, and some new things.
Privacy-aware universal deployment of LLM fine-tuning
Speaker: Yixin Dong (UW SAMPL & SJTU)
Speaker bio Yixin is a senior undergraduate from Shanghai Jiao Tong University and he is a visiting student at University of Washington, advised by Tianqi Chen and Luis Ceze, his research interest include LLM systems and Machine Learning Compilers. His current research project is On-Device LLM Fine-tuning with Machine Learning Compilers.
Scaling up Retrieval-based Language Models
Speaker: Rulin Shao (UWNLP)
Speaker bio Rulin is a first-year PhD at University of Washington advised by Prof. Pang Wei Koh and Prof. Luke Zettlemoyer. She worked as an applied scientist at AWS, focusing on large-scale pretraining for Amazon Bedrock , from January 2023 to June 2023. Before that, she obtained her master in Machine Learning at CMU advised by Prof. Eric Xing. Rulin did her undergraduate in Mathematics at XJTU. Her current research interest is to make LLMs more accessible to academics with a focus on retrieval-based LMs and efficient system-algorithm codesign.
Efficient Memory Management for Large Language Model Serving with PagedAttention
Speaker: Woosuk Kwon (UC Berkeley)
Abstract High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4× with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm
Speaker bio Woosuk is a Ph.D. student at UC Berkeley, advised by Prof. Ion Stoica. He is interested in building practical, flexible, and high-performance software systems for emerging applications such as large language models.
Recording: internal
FlexGen: High-throughput Generative Inference of Large Language Models with a Single GPU
Speaker: Ying Sheng (Stanford)
Abstract The high computational and memory requirements of large language model (LLM) inference traditionally make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. Through a linear programming optimizer, it searches for efficient patterns to store and access tensors. FlexGen further compresses these weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
Speaker bio Ying is a Ph.D. student at Computer Science Department, Stanford University, affiliated with Centaur group. She is advised by Clark Barrett. Prior to that, She received a M.S. in Computer Science from Columbia University in 2017 and a B.E. in Computer Science and Technology from ACM Honored Class, Shanghai Jiao Tong University in 2016. She is also a visiting researcher at Sky@UC Berkeley, working with Ion Stoica and Joseph E. Gonzalez. Before that, She was a Ph.D. resident at X, the moonshot Factory (the team graduated to Labs@Google during my residence) (2022), working on AI for Code with Michele Catasta; a research intern at Facebook Novi (2021), working on smart contract verification with Prof. David Dill; a quantitative software engineer at Two Sigma (2018); and a research intern at Microsoft Research Asia (2015), working with Chin-Yew Lin.
Recording: public
TC-GNN: Accelerating Sparse Graph Neural Network Computation via Dense Tensor Core on GPUs
Speaker: Yuke Wang (UCSB)
Abstract Recently, graph neural networks (GNNs), as the backbone of graph-based machine learning, demonstrate great success in various domains (e.g., e-commerce). However, the performance of GNNs is usually unsatisfactory due to the highly sparse and irregular graph-based operations. To this end, we propose, TC-GNN, the first GPU Tensor Core Unit (TCU) based GNN acceleration framework. The core idea is to reconcile the "Sparse" GNN computation with "Dense" TCU. Specifically, we conduct an in-depth analysis of the sparse operations in mainstream GNN computing frameworks. We introduce a novel sparse graph translation technique to facilitate TCU processing of sparse GNN workload. We also implement an effective CUDA core and TCU collaboration design to fully utilize GPU resources. We fully integrate TC-GNN with the Pytorch framework for ease of programming. Rigorous experiments show an average of 1.70X speedup over the state-of-the-art Deep Graph Library framework across various GNN models and dataset settings.
Speaker bio Yuke Wang is a fifth-year Doctor of Philosophy (Ph.D.) candidate in the department of computer science at the University of California, Santa Barbara (UCSB). He got his Bachelor of Engineering (B.E.) in software engineering from the University of Electronic Science and Technology of China (UESTC) in 2018. At UCSB, Yuke is working with Prof Yufei Ding. Yuke's research interests include high-performance computing and deep learning algorithms. His recent ongoing projects cover graph neural network (GNN) optimization and its acceleration on GPUs. Yuke is also the recipient of NVIDIA Graduate Fellowship 2022 - 2023.
The Sparse Abstract Machine: Sparse Tensor Algebra as Dataflow Graphs
Speaker: Olivia Hsu (Stanford)
Abstract This talk is relatively new work on compiling sparse tensor algebra to dataflow hardware and accelerators, in collaboration with MIT and Stanford University. We propose the Sparse Abstract Machine (SAM), an intermediate representation and abstract machine model for targeting sparse tensor algebra to reconfigurable and fixed-function spatial dataflow accelerators. SAM defines a streaming abstraction with sparse primitives that encompass fused sparse tensor algebra expressions for arbitrary dataflow. SAM dataflow graphs naturally separate tensor formats from algorithms and is expressive enough to incorporate many sparse-iteration and hardware-specific optimizations. We show an automatic compilation technique from a high-level language to SAM. We also show how SAM can be leveraged to develop new hardware for sparse tensor algebra.
Speaker bio Olivia is a computer science PhD student at Stanford University advised by Professor Kunle Olukotun and Professor Fredrik Kjolstad. She currently works on mapping sparse applications to domain-specific architectures, reconfigurable dataflow hardware, and accelerators through the TACO compiler. Her research interests broadly include computer architecture, computer and programming systems, compilers, programming models and languages, and digital circuits/VLSI.
Programming Abstractions and Efficient Compilation Techniques for Modern FPGAs
Speaker: Luis Vega (UW)
Abstract Modern field-programmable gate arrays (FPGAs) have recently powered high-profile efficiency gains in systems from datacenters to embedded devices by offering ensembles of heterogeneous, reconfigurable hardware units. Programming stacks for FPGAs, however, are stuck in the past— they are based on traditional hardware languages, which were appropriate when FPGAs were simple, homogeneous fabrics of basic programmable lookup tables. Nowadays, FPGAs are highly heterogeneous architectures that support a wide variety of compute operations such as scalar, vector, fused, and floating-point arithmetic together with different kinds of programmable memories. Unfortunately, the behavioral semantics available in hardware languages today cannot effectively capture these architectural advances, resulting in inefficient programs that are missing all the benefits of specialization. An example of this abstraction gap is that vector operations cannot be described behaviorally for targeting vector hardware (SIMD) available in modern FPGAs. This thesis describes Reticle, a new low-level abstraction for FPGA programming that, unlike existing languages, explicitly represents the special-purpose units available on a particular FPGA device. Reticle has two levels: a portable intermediate language and a target-specific assembly language. The design goal of the intermediate language is to describe behavior, while the assembly language aims for layout. Furthermore, I demonstrate how to lower intermediate programs to assembly programs, using instruction selection which can be both faster and deterministic compared to existing technology mapping approaches. I use Reticle to implement compute centric benchmarks, such as linear algebra operators and coroutines, and find that Reticle compilation runs up to 100 times faster than current approaches while producing comparable or better run-time and utilization. Additionally, I show how using Reticle’s memory instructions can lead to 5.26𝑥 performance improvement on an existing encryption application (AES).
Coverage-Guided Tensor Compiler Fuzzing with Joint IR-Pass Mutation
Speaker: Jiawei Liu (UIUC)
Abstract In the past decade, Deep Learning (DL) systems have been widely deployed in various application domains to facilitate our daily life, e.g., natural language processing, healthcare, activity recognition, and autonomous driving. Meanwhile, it is extremely challenging to ensure the correctness of DL systems (e.g., due to their intrinsic nondeterminism), and bugs in DL systems can cause serious consequences and may even threaten human lives. In the literature, researchers have explored various techniques to test, analyze, and verify DL models, since their quality directly affects the corresponding system behaviors. Recently, researchers have also proposed novel techniques for testing the underlying operator-level DL libraries, which provide general binary implementations for each high-level DL operator and are the foundation for running DL models on different hardware platforms. However, there is still limited work targeting the reliability of the emerging tensor compilers (also known as DL compilers), which aim to automatically compile high-level tensor computation graphs directly into high-performance binaries for better efficiency, portability, and scalability than traditional operator-level libraries. In this talk, I'll introduce Tzer, a practical fuzzing technique for the widely used TVM tensor compiler. Tzer focuses on mutating the low-level Intermediate Representation (IR) for TVM due to the limited mutation space for the high-level IR. Our experimental results show that Tzer substantially outperforms existing fuzzing techniques on tensor compiler testing. To date, Tzer has detected 49 previously unknown bugs for TVM, with 37 bugs confirmed and 25 bugs fixed (PR merged).
Speaker bio Jiawei is a first-year CS PhD student at UIUC advised by Lingming Zhang. His primary research goal is to make future software infrastructures: easy-to-use, high-performance and reliable. At present, He is developing PLSE techniques to make ML Systems reliable and efficient.
Recording: public
Exploiting Parallelism in Large Scale Deep Learning Model Training: From Chips to Systems to Algorithms
Speaker: Saurabh Kulkarni (GraphCore)
Abstract We live in a world where hyperscale systems for machine intelligence are increasingly being used to solve complex problems ranging from natural language processing to computer vision to molecular modeling, drug discovery and recommendation systems. A convergence of breakthrough research in machine learning models and algorithms, increased accessibility to hardware systems at cloud scale for research and thriving software ecosystems are paving the way for an exponential increase in model sizes. Effective parallel processing and model decomposition techniques and large clusters of accelerators will be required to train these models of the future economically. Attend this session to learn about how Graphcore aims to address scale challenges associated with training large models. Get to know our Intelligent Processing Unit (IPU) – a purpose-built hardware accelerator with a unique MIMD architecture – designed to address the most demanding compute and memory bandwidth needs of modern ML models. Our network disaggregated architecture uniquely positions us to build highly scalable systems (IPU-PODs) with thousands of accelerators aimed at exploiting various dimensions of parallelism.
Speaker bio Saurabh Kulkarni is Head of Engineering for North America at Graphcore. Over the last 20 years, he has held various leadership positions at Intel, Microsoft, and Oracle prior to his current role at Graphcore. His roles have spanned a variety of domains, including computer architecture, server platform Architecture, cloud infrastructure, and hardware accelerators for AI/ML.
Recording: public
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Speaker: Yu Tang (National University of Defense Technology (NUDT))
Abstract With the development of large-scale deep neural networks, significant success has been achieved in various domains. Nevertheless, the further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain: 1) The efficiency of recomputation is limited for both static and dynamic methods. 2) Swapping requires researchers to offload parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA (Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we firstly propose a filter algorithm to select the optimal tensors to be released out of GPU memory and secondly present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent, but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04× maximum batchsize when training ResNet-50 and 2.25× when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.
Speaker bio Yu Tang received his M.S. and B.S. degrees in Computer Science from National University of Defense Technology (NUDT) in 2020 and 2018 respectively, where he is currently pursuing the doctor's degree. His current research interests include distributed machine learning, memory optimization, training optimization of large-scale models, and alternating direction method of multipliers (ADMM). Now, he is an intern in Shanghai AI lab.
Recording: internal
Tackling the Communication Bottlenecks of Distributed Deep Learning Training Workloads
Speaker: Chen-Yu Ho (KAUST)
Abstract Deep learning-based solutions achieve significant advancements in tasks such as natural language processing, image classification, and recommendation. As more sophisticated models are developed, the increasing training time and memory footprint force practitioners to use distributed training. State-of-the-art distributed training algorithms use iterative synchronization among participating nodes to ensure model consistency and correctness, putting a heavy burden on network communication. Furthermore, hardware accelerators are improving at a faster rate than network bandwidth growth. As a result, the communication phase of distributed training is frequently the bottleneck. In this talk, I will discuss three approaches to dealing with communication bottlenecks: application-level, network-level, and co-design solutions.
Speaker bio Chen-Yu is a 4th year Ph.D. student at KAUST. Combining his interests in fundamental systems and the trend of machine learning, Chen-Yu is collaborating with colleagues on developing efficient distributed machine learning systems, to be specific, trying to alleviate network bandwidth bottleneck by offloading aggregation operations to network devices.
Recording: public
Verified Tensor-Program Optimization Via High-Level Scheduling Rewrites
Speaker: Amanda Liu (MIT)
Abstract We present a lightweight Coq framework for optimizing tensor kernels written in a pure, functional array language. Optimizations rely on user scheduling using a series of verified, semantics-preserving rewrites. Unusually for compilation targeting imperative code with arrays and nested loops, all rewrites are source-to-source within a purely functional language. Our language comprises a set of core constructs for expressing high-level computation detail and a set of what we call reshape operators, which can be derived from core constructs but trigger low-level decisions about storage patterns and ordering. We demonstrate that not only is this system capable of deriving the optimizations of existing state-of-the-art languages like Halide and generating comparably performant code, it is also able to schedule a family of useful program transformations beyond what is reachable in Halide.
Speaker bio Amanda is a second-year PhD student working with Prof. Adam Chlipala and Prof. Jonathan Ragan-Kelley. Her interests are using formal methods, programming languages, and types to develop verified, principled methods for writing high-performance systems.
Recording: public
Compiler and Runtime Techniques for Optimizing Deep Learning Applications
Speaker: Steven Lyubomirsky (UW)
Abstract As the scaling and performance demands for deep learning systems have grown, system designers have struggled to incorporate innovations at opposite ends of the system stack: more varied and complex deep learning models and specialized hardware accelerators. New models that use data structures and dynamic control flow to address new learning problems cannot immediately benefit from previous system-level optimizations, which are defined over static dataflow graphs. Meanwhile, many novel hardware accelerators for accelerating common deep learning operations present unusual computing models and often require manual modification of applications to use, demanding expertise in both the deep learning domain and in hardware. The challenges in adding support for accelerators in existing compiler stacks slow development cycles and constrain deep learning systems' capabilities and efficiency. Following earlier work on the Relay IR for the TVM framework, this dissertation demonstrates that system design problems in the deep learning domain can be approached by formalizing deep learning models as programs broadly (rather than assuming a more specific structure like a graph) and applying traditional compiler engineering techniques, simplifying various optimizations and transformations. In particular, this work addresses the use of runtime systems to support optimizations for dynamic deep learning models and on systematically supporting accelerators through the use of a formal software/hardware interface. Traditional deep learning model optimizations have been conceived as transformations on static dataflow graphs, but can be adapted to perform similar reasoning dynamically (and hence make no assumptions about control flow) by performing similar reasoning in a runtime system, guided by heuristics that depend on dynamically gathered information. This work details the specific example of Dynamic Tensor Rematerialization, which is an online approach to the problem of gradient checkpointing (recomputing intermediate activations instead of storing them to reduce the memory required for training) that achieves results comparable to optimal static techniques but generalizes to arbitrarily dynamic models. In addressing the problem of supporting accelerators in deep learning compiler stacks, this work demonstrates that a formal software/hardware interface enables traditional compiler techniques like instruction selection to be adapted for accelerators. Namely, this work presents a methodology for implementing a compiler stack with extensible support for accelerators that uses term rewriting to automatically discover opportunities to apply accelerator operations and lays the foundations for extending formal verification to entire compilation stacks with accelerator support.
Recording: public
Note: Steven's PhD Defense talk, Congrats!
Alpa: Automating Inter- and Intra- Operator Parallelism for Distributed Deep Learning
Speaker: Lianmin Zheng (UC Berkeley)
Abstract Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations, which does not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive the optimal parallel execution plan in each independent parallelism level and implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans.
Speaker bio Lianmin is a third-year Ph.D. student in the EECS department at UC Berkeley, advised by Ion Stoica and Joseph E. Gonzalez. His research interests lie in the intersection of machine learning and programming systems, especially domain-specific compilers for accelerated and scalable deep learning.
Recording: public
Efficient Batching Techniques for Dynamic Deep Learning
Speaker: Pratik Fegade (CMU)
Speaker bio Pratik is a PhD student in the Computer Science Department at CMU and he works with Prof. Todd Mowry, Prof. Phil Gibbons and Prof. Tianqi Chen. His current research focus is on building better compilation and execution stacks for handling dynamism in deep learning models. In the past, He has worked on building compiler analysis techniques that understand and optimize programs written in general-purpose programming languages at semantically higher levels than is currently possible.
Recording: internal
Accessible and Scalable Transformers through 8-bit Matrix Multiplication and 8-bit Optimizers
Speaker: Tim Dettmers (UW)
Speaker bio Tim Dettmers is a PhD student at the University of Washington advised by Luke Zettlemoyer working on representation learning, and neuro-inspired and hardware optimized deep learning. Previously he interned at the UCL Machine Reading Group where he was advised by Sebastian Riedel working on information retrieval and link prediction in knowledge graphs. He did his master in computer science at the University of Lugano.
Recording: internal
Resource-Efficient Execution of Deep Learning Computations
Speaker: Deepak Narayanan (Microsoft Research)
Abstract Deep Learning models have enabled state-of-the-art results across a broad range of applications; however, training these models is extremely time- and resource-intensive, taking weeks on clusters with thousands of expensive accelerators in the extreme case. In this talk, I will describe two ideas that help improve the resource efficiency of model training. In the first half of the talk, I will discuss how pipelining can be used to accelerate distributed training. Pipeline parallelism facilitates model training with lower communication overhead than previous methods while still ensuring high compute resource utilization. Pipeline parallelism also enables the efficient training of large models that do not fit on a single worker; for example, we used pipeline parallelism at Nvidia to efficiently scale training to language models with a trillion parameters on 3000+ GPUs. In the second half of this talk, I will describe how resources in a shared cluster with heterogeneous compute resources (e.g., different types of hardware accelerators) should be partitioned among different users to optimize objectives specified over one or more training jobs. Heterogeneity-aware scheduling can improve various scheduling objectives, such as average completion time, makespan, or cloud computing resource cost, by up to 3.5x.
Speaker bio Deepak is a Senior Researcher in the Systems group at Microsoft Research Redmond. His broad research interests include distributed systems and cloud computing. In particular, he is interested in the Systems problems associated with learning and deploying machine learning models at scale. He graduated from Stanford with a Ph.D. in Computer Science in September 2021, where he was advised by Prof. Matei Zaharia.
Recording: public
Synthesizing Programmable Accelerators: A Compiler’s Perspective
Speaker: Jian Weng (UCLA)
Abstract Because of the waning benefit of transistor scaling, specialized accelerators emerge, and already achieved big success in both industries and academics. However, all these accelerators require intensive human effort to design the hardware itself as well as the ISA and software stack, which can hardly justify designing a specialized accelerator for each domain of interest. Our work makes a very first attempt to automate this process. In this talk, I will present an automated, and program-behavior-centric paradigm for full-stack programmable accelerator design.
Speaker bio Jian is a 5th-year Ph.D. Candidate from UCLA under the guidance of Prof. Tony Nowatzki. His research interests mainly lie in designing and analyzing specialized accelerators and their associated compilation technologies.
Recording: public
Overview of Sparse TIR project
Speaker: Zihao Ye (UW)
Recording: internal
Overview of TIR project
Speaker: Ruihang Lai (SJTU)
Speaker bio Ruihang is an undergraduate student at Shanghai Jiao Tong University, he worked on the Apache TVM project with Tianqi Chen. His research interests include Machine Learning Systems and Deep Learning Compilers.
Recording: public
Overview of Relax Project
Speaker: Andrew Liu (UW)
Large-scale GNN training with DGL
Speaker: Da Zheng (AWS AI)
Abstract Graph neural networks (GNN) have shown great success in learning from graph-structured data. They are widely used in various applications, such as recommendation, fraud detection, and search. In these domains, the graphs are typically large, containing hundreds of millions of nodes and several billions of edges. To scale graph neural network training on large graphs, we adopt hybrid CPU/GPU mini-batch training, in which we store graph data and sample nodes and their neighbors in CPU, and perform mini-batch computation in GPUs. In this talk, I will discuss the optimizations for GNN mini-batch training in two aspects. First, I will discuss our effort of scaling GNN training to a cluster of CPU and GPUs. We develop multiple optimizations to address the challenges in distributed hybrid CPU/GPU training (reduce data movement and balance the load in mini-batch computation). With these optimizations, we show close to good speedup without compromising model accuracy and train GNN models on a graph with 100M nodes with less than 1 minute in a cluster of 32 GPUs. In the second part, I will discuss a new neighbor sampling algorithm called global neighbor sampling (GNS) to reduce the data copy from CPU to GPUs. This algorithm efficiently samples neighbor nodes that are already stored in a GPU cache to reduce data copy from CPU to GPU. We show that our neighbor sampling algorithm can achieve state-of-the-art model performance while speeding the mini-batch training by a factor of 2 to 14 compared with the previous state-of-the-art algorithms.
Speaker bio Da Zheng is a senior applied scientist at AWS AI, where he leads the project Deep Graph Library and DGL-KE for graph neural networks and knowledge graphs. His research interest covers a wide range of areas, including high-performance computing, large-scale data analysis systems, data mining and machine learning. He got a PhD from the department of computer science at Johns Hopkins University. During his PhD, he worked on FlashGraph and FlashR, frameworks for large-scale graph analysis and data analysis on solid-state drives (SSDs).
Recording: public
Decoupling Algorithm from Hardware Customizations for Software-Defined Reconfigurable Computing
Speaker: Yi-Hsiang (Sean) Lai (Cornell)
Abstract With the pursuit of improving compute performance under strict power constraints, there is an increasing need for deploying applications to heterogeneous hardware architectures with spatial accelerators such as FPGAs. However, although these heterogeneous computing platforms are becoming widely available, they are very difficult to program especially with FPGAs. As a result, the use of such platforms has been limited to a small subset of programmers with specialized hardware knowledge. In this talk, I will first present SuSy, a programming framework composed of a domain-specific language (DSL) and a compilation flow that enables programmers to productively build high-performance systolic arrays on FPGAs. With SuSy, programmers express the design functionality in the form of uniform recurrence equations (UREs). The URE description in SuSy is followed by a set of decoupled spatial mapping primitives that specify how to map the equations to a spatial architecture. More concretely, programmers can apply space-time transformations and several other memory and I/O optimizations to build a highly efficient systolic architecture productively. After that, I will present HeteroCL, an open-source programming infrastructure composed of a Python-based domain-specific language and an FPGA-targeted compilation flow. Similar to SuSy, HeteroCL cleanly decouples algorithm specifications from three important types of hardware customization in compute, data types, and memory architectures. In addition, HeteroCL produces highly efficient hardware implementations for a variety of popular workloads by targeting spatial architecture templates such as systolic arrays and stencil with dataflow architectures.
Speaker bio Yi-Hsiang Lai is currently a 6th-year Ph.D. student at Cornell advised by Prof. Zhiru Zhang. He received both his Master's and Bachelor's degrees in Electrical Engineering from National Taiwan University. His research focuses on high-level synthesis for FPGAs, programming models, and compilers.
Recording: public
Autotuning Production Machine Learning Compilers
Speaker: Mangpo Phothilimthana (Google Research)
Abstract Search-based techniques have been demonstrated effective in solving complex optimization problems that arise in domain-specific compilers for machine learning (ML). Unfortunately, deploying such techniques in production compilers is impeded by several limitations. In this talk, I will present an autotuner for production ML compilers that can tune both graph-level and subgraph-level optimizations at multiple compilation stages. The autotuner applies a flexible search methodology that defines a search formulation for joint optimizations by accurately modeling the interactions between different compiler passes. The autotuner tunes tensor layouts, operator fusion decisions, tile sizes, and code generation parameters in XLA, a production ML compiler, using various search strategies. We demonstrate how to incorporate machine learning techniques such as a learned cost model and various learning-based search strategies to reduce autotuning time. Our learned cost model has high accuracy and outperforms a heavily-optimized analytical performance model. In an evaluation across 150 ML training and inference models on Tensor Processing Units (TPUs), the autotuner offers up to 2.4x and an average 5% runtime speedup over the heavily-optimized XLA compiler. The autotuner has been deployed to automatically tune the most heavily-used production models in Google’s fleet everyday.
Speaker bio Mangpo is a research scientist at Google Brain, where she leads Machine Learning for Machine Learning Compilers effort (one of Google Brain moonshots in 2020). Her research interests include compilers, machine learning for systems, program synthesis, and efficient computing. Mangpo completed her PhD in Computer Science at UC Berkeley. Her dissertation focuses on synthesis-aided compilation and programming models for emerging architectures, ranging from an ultra-low-power processor to a programmable network card.
Recording: public