Publications
(* = Equal Contribution)
Empowering Distributed Training with Sparsity-driven Data Synchronization
, Zhaozhuo Xu*, Jingyi Xi*, Yuke Wang, Anshumali Shrivastava, T. S. Eugene Ng
OSDI '25 | USENIX Symposium on Operating Systems Design and Implementation
Marconi: Prefix Caching for the Era of Hybrid LLMs
Rui Pan, , Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali
MLSys '25 (Outstanding Paper Honorable Mention)
Scaling Deep Learning through Optimizing Data- and Management-Plane Communications
[slides]
PhD Dissertation 2023 | Rice University
GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints
[slides]
, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, Yida Wang
SOSP '23 | ACM Symposium on Operating Systems Principles
Augmented Queue: A Scalable In-Network Abstraction for Data Center Network Sharing
Xinyu Crystal Wu*, , Weitao Wang, T. S. Eugene Ng
ACM SIGCOMM '23
Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training
[code]
, Xinyu Crystal Wu, Zhaozhuo Xu, T. S. Eugene Ng
MLSys '23 | Conference on Machine Learning and Systems
Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
[code]
[slides]
, Haibin Lin, Yibo Zhu, T. S. Eugene Ng
EuroSys '23 | The European Conference on Computer Systems
DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks
[code]
, Zhaozhuo Xu*, Xinyu Crystal Wu, Anshumali Shrivastava, T. S. Eugene Ng
ICML '22 | International Conference on Machine Learning
Shufflecast: An Optical, Data-rate Agnostic and Low-Power Multicast Architecture for Next-Generation Compute Clusters
Sushovan Das, Afsaneh Rahbar, Xinyu Crystal Wu, , Weitao Wang, Ang Chen, T. S. Eugene Ng
ToN '22 | IEEE/ACM Transactions on Networking
DMXDAG: A Hybrid Abstraction for Emerging Applications
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, , Ang Chen, T. S. Eugene Ng
HotNets '21 | ACM Workshop on Hot Topics in Networks
Efficient and Less Centralized Federated Learning
Li Chou, Zichang Liu, , Anshumali Shrivastava
ECML-PKDD '21
Intra-host Rate Control with Centralized Approach
, Ke Liu, Yifan Shen, Jack YB Lee, Mingyu Chen, Lixin Zhang
CLUSTER '16 | IEEE International Conference on Cluster Computing
A Novel Approach for All-to-All Routing in All-optical Hypersquare Torus Network
, Ke Liu, Long Li, Weiyi Chen, Mingyu Chen, Lixin Zhang
CF '16 | ACM International Conference on Computing Frontiers
Adaptive Rate Control over Mobile Data Networks with Heuristic Rate Compensations
Ke Liu, , Jack Y. B. Lee, Mingyu Chen, Lixin Zhang
IWQoS '16 | IEEE/ACM International Symposium on Quality of Service
Experiences
Amazon Web Services AI 10/2023 - present  
Applied Scientist
Amazon Web Services 09/2022 - 06/2023
Applied Scientist Intern, mentored by Yida Wang and Zhen Jia
Designed a general framework for fault tolerance in large model training.
ByteDance 02/2021 - 08/2022
Research Intern, mentored by Yibo Zhu and Haibin Lin
Designed a general framework to accelerate compression-enabled distributed training by searching for the optimal compression strategy.
Services
PC member in Conferences
USENIX ATC, MLSys
2025
ICDCS, NeurIPS, ICDM, Bench, AISTATS, ICLR
2024
Reviewer in Transactions
Transactions on Networking, BenchCouncil
2024