SAMPL organizes group lunch on a weekly basis and we invite speakers from both academia and industry to present their work. The goal is to provide a platform for researchers to share their work and to foster collaborations. The talks are open to everyone in the community. If you are interested in giving a talk, please contact the organizers.
Sometimes SAMPL talks are combined with other seminars, please check the following links for more information:
Feb 23th, 2024, 12:00 - 13:00 PST
Location: CSE 505
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Speaker:
Keisuke Kamahori (University of Washington)
Abstract
Large Language Models (LLMs) based on Mixture-of-Experts (MoE) architecture are showing promising performance on various tasks. However, running them on resource-constrained settings, where GPU memory resources are not abundant, is challenging due to huge model sizes. Existing systems that offload model weights to CPU memory suffer from the significant overhead of frequently moving data between CPU and GPU. In this paper, we propose Fiddler, a resource-efficient inference engine with CPU-GPU orchestration for MoE models. The key idea of Fiddler is to use the computation ability of the CPU to minimize the data movement between the CPU and GPU. Our evaluation shows that Fiddler can run the uncompressed Mixtral-8x7B model, which exceeds 90GB in parameters, to generate over 3 tokens per second on a single GPU with 24GB memory, showing an order of magnitude improvement over existing methods. The code of Fiddler is publicly available at https://github.com/efeslab/fiddler
Speaker bio
Keisuke Kamahori is a first-year Ph.D. student at Paul G. Allen School of Computer Science & Engineering, University of Washington, advised by Baris Kasikci. He is broadly interested in computer systems and architecture, with a focus on system for LLMs recently. Prior to that, he received B. Sc. in Information Science from the University of Tokyo in 2023, advised by Shinya Takamaeda-Yamazaki. He also worked with James Larus at EPFL in the summer of 2022.
Add to calendar:
Feb 16th, 2024
Ray (Data): A scalable and flexible ML toolkit
Speaker:
Stephanie Wang (University of Washington & Anyscale)
Abstract
Ray is a framework for scaling ML and Python applications. In this talk, I'll explain the motivation and a brief history of Ray. The system consists of a general-purpose distributed execution *Core* and a collection of distributed libraries that are designed for specific domains common to end-to-end ML applications, such as distributed training or inference. I'll also give a deep dive of Ray Data, one of the key libraries that supports data processing for ML. Finally, I'll give an update on our current work in reducing system-level overheads in Ray to enable fine-grained orchestration of distributed accelerators. The goal is to reduce developer burden in building high-performance distributed systems, such as for LLM inference.
If we have time, we'll do a short tutorial together on Ray Data. If you'd like to follow along, please come prepared with a Python 3.10 environment and install Ray with `pip install -U ray` (more installation instructions here: https://docs.ray.io/en/latest/ray-overview/installation.html).
Speaker bio
Stephanie is an incoming assistant professor in computer science at the University of Washington, starting in fall 2024. If you are interested in distributed systems, systems for machine learning and data processing, programming languages, and/or how these topics fit together. Previously, She was a PhD student in the RISELab at UC Berkeley, where she was advised by Ion Stoica. She is also a co-creator and committer for the open-source project, Ray, which has been used to train ChatGPT, serve high-performance LLMs, and break the CloudSort 100TB record. At the moment, She is continuing to develop the Ray ecosystem as a software engineer at Anyscale, working primarily on Ray Data, a system for distributed data preprocessing for ML, and some new things.
May 26, 2022
DELTA: Dynamically Optimizing GPU Memory beyond Tensor Recomputation
Speaker:
Yu Tang (National University of Defense Technology (NUDT))
Abstract
With the development of large-scale deep neural networks, significant success has been achieved in various domains. Nevertheless, the further development of deep neural networks is hampered by the limited GPU memory resource. Therefore, the optimization of GPU memory resources is highly demanded. Swapping and recomputation are commonly applied to make better use of GPU memory in deep learning. However, as an emerging domain, several challenges remain: 1) The efficiency of recomputation is limited for both static and dynamic methods. 2) Swapping requires researchers to offload parameters manually, which incurs a great time cost. 3) There is no such dynamic and fine-grained method that involves tensor swapping together with tensor recomputation nowadays. To remedy the above issues, we propose a novel scheduler manager named DELTA (Dynamic tEnsor offLoad and recompuTAtion). To the best of our knowledge, we are the first to make a reasonable dynamic runtime scheduler on the combination of tensor swapping and tensor recomputation without user oversight. In DELTA, we firstly propose a filter algorithm to select the optimal tensors to be released out of GPU memory and secondly present a director algorithm to select a proper action for each of these tensors. Furthermore, prefetching and overlapping are deliberately considered to overcome the time cost caused by swapping and recomputing tensors. Experimental results show that DELTA not only saves 40%-70% of GPU memory, surpassing the state-of-the-art method to a great extent, but also gets comparable convergence results as the baseline with acceptable time delay. Also, DELTA gains 2.04× maximum batchsize when training ResNet-50 and 2.25× when training ResNet-101 compared with the baseline. Besides, comparisons between the swapping cost and recomputation cost in our experiments demonstrate the importance of making a reasonable dynamic scheduler on tensor swapping and tensor recomputation, which refutes the arguments in some related work that swapping should be the first and best choice.
Speaker bio
Yu Tang received his M.S. and B.S. degrees in Computer Science from National University of Defense Technology (NUDT) in 2020 and 2018 respectively, where he is currently pursuing the doctor's degree. His current research interests include distributed machine learning, memory optimization, training optimization of large-scale models, and alternating direction method of multipliers (ADMM). Now, he is an intern in Shanghai AI lab.
Recording:
internal
Apr 14th, 2022
Compiler and Runtime Techniques for Optimizing Deep Learning Applications
Speaker:
Steven Lyubomirsky (UW)
Abstract
As the scaling and performance demands for deep learning systems have grown, system designers have struggled to incorporate innovations at opposite ends of the system stack: more varied and complex deep learning models and specialized hardware accelerators. New models that use data structures and dynamic control flow to address new learning problems cannot immediately benefit from previous system-level optimizations, which are defined over static dataflow graphs. Meanwhile, many novel hardware accelerators for accelerating common deep learning operations present unusual computing models and often require manual modification of applications to use, demanding expertise in both the deep learning domain and in hardware. The challenges in adding support for accelerators in existing compiler stacks slow development cycles and constrain deep learning systems' capabilities and efficiency.
Following earlier work on the Relay IR for the TVM framework, this dissertation demonstrates that system design problems in the deep learning domain can be approached by formalizing deep learning models as programs broadly (rather than assuming a more specific structure like a graph) and applying traditional compiler engineering techniques, simplifying various optimizations and transformations. In particular, this work addresses the use of runtime systems to support optimizations for dynamic deep learning models and on systematically supporting accelerators through the use of a formal software/hardware interface. Traditional deep learning model optimizations have been conceived as transformations on static dataflow graphs, but can be adapted to perform similar reasoning dynamically (and hence make no assumptions about control flow) by performing similar reasoning in a runtime system, guided by heuristics that depend on dynamically gathered information. This work details the specific example of Dynamic Tensor Rematerialization, which is an online approach to the problem of gradient checkpointing (recomputing intermediate activations instead of storing them to reduce the memory required for training) that achieves results comparable to optimal static techniques but generalizes to arbitrarily dynamic models. In addressing the problem of supporting accelerators in deep learning compiler stacks, this work demonstrates that a formal software/hardware interface enables traditional compiler techniques like instruction selection to be adapted for accelerators. Namely, this work presents a methodology for implementing a compiler stack with extensible support for accelerators that uses term rewriting to automatically discover opportunities to apply accelerator operations and lays the foundations for extending formal verification to entire compilation stacks with accelerator support.
Recording:
public
Note: Steven's PhD Defense talk, Congrats!