Preprint
Zen: Near-Optimal Communications for Sparse and Distributed DNN
, Zhaozhuo Xu*, Anshumali Shrivastava, T. S. Eugene Ng
Under review
Publications
(* = Equal Contribution)
Scaling Deep Learning through Optimizing Data- and Management-Plane Communications
[slides]
PhD Dissertation 2023 | Rice University
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
[slides]
, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, Yida Wang
SOSP '23 | ACM Symposium on Operating Systems Principles
Augmented Queue: A Scalable In-Network Abstraction for Data Center Network Sharing
Xinyu Crystal Wu*, , Weitao Wang, T. S. Eugene Ng
ACM SIGCOMM '23
Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training
[code]
, Xinyu Crystal Wu, Zhaozhuo Xu, T. S. Eugene Ng
MLSys '23 | Conference on Machine Learning and Systems
Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
[code]
[slides]
, Haibin Lin, Yibo Zhu, T. S. Eugene Ng
EuroSys '23 | The European Conference on Computer Systems
DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks
[code]
, Zhaozhuo Xu*, Xinyu Crystal Wu, Anshumali Shrivastava, T. S. Eugene Ng
ICML '22 | International Conference on Machine Learning
Shufflecast: An Optical, Data-rate Agnostic and Low-Power Multicast Architecture for Next-Generation Compute Clusters
Sushovan Das, Afsaneh Rahbar, Xinyu Crystal Wu, , Weitao Wang, Ang Chen, T. S. Eugene Ng
ToN '22 | IEEE/ACM Transactions on Networking
DMXDAG: A Hybrid Abstraction for Emerging Applications
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, , Ang Chen, T. S. Eugene Ng
HotNets '21 | ACM Workshop on Hot Topics in Networks
Efficient and Less Centralized Federated Learning
Li Chou, Zichang Liu, , Anshumali Shrivastava
ECML-PKDD '21
Intra-host Rate Control with Centralized Approach
, Ke Liu, Yifan Shen, Jack YB Lee, Mingyu Chen, Lixin Zhang
CLUSTER '16 | IEEE International Conference on Cluster Computing
A Novel Approach for All-to-All Routing in All-optical Hypersquare Torus Network
, Ke Liu, Long Li, Weiyi Chen, Mingyu Chen, Lixin Zhang
CF '16 | ACM International Conference on Computing Frontiers
Adaptive Rate Control over Mobile Data Networks with Heuristic Rate Compensations
Ke Liu, , Jack Y. B. Lee, Mingyu Chen, Lixin Zhang
IWQoS '16 | IEEE/ACM International Symposium on Quality of Service
Experiences
Amazon Web Services AI 10/2023 - present  
Applied Scientist
Amazon Web Services 09/2022 - 06/2023
Applied Scientist Intern, mentored by Yida Wang and Zhen Jia
Designed a general framework for fault tolerance in large model training.
ByteDance 02/2021 - 08/2022
Research Intern, mentored by Yibo Zhu and Haibin Lin
Designed a general framework to accelerate compression-enabled distributed training by searching for the optimal compression strategy.