Preprint
Zen: Near-Optimal Communications for Sparse and Distributed DNN
, Zhaozhuo Xu, Anshumali Shrivastava, T. S. Eugene Ng
Submitted to Sigcomm'23
Augmented Queue: A Salable In-Network Abstraction for Data Center Network Sharing
Xinyu Crystal Wu*, , Weitao Wang, T. S. Eugene Ng
(* = Equal Contribution)
Submitted to Sigcomm'23
Publications
Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training
, Xinyu Crystal Wu, Zhaozhuo Xu, T. S. Eugene Ng
MLSys'23 (to appear) | Conference on Machine Learning and Systems
Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies
[pdf]
, Haibin Lin, Yibo Zhu, T. S. Eugene Ng
EuroSys'23 | The European Conference on Computer Systems
DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks
[pdf]
, Zhaozhuo Xu, Xinyu Crystal Wu, Anshumali Shrivastava, T. S. Eugene Ng
ICML'22 | International Conference on Machine Learning
Shufflecast: An Optical, Data-rate Agnostic and Low-Power Multicast Architecture for Next-Generation Compute Clusters
Sushovan Das, Afsaneh Rahbar, Xinyu Crystal Wu, , Weitao Wang, Ang Chen, T. S. Eugene Ng
ToN'22 | IEEE/ACM Transactions on Networking
DMXDAG: A Hybrid Abstraction for Emerging Applications
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, , Ang Chen, T. S. Eugene Ng
HotNets'21 | ACM Workshop on Hot Topics in Networks
Efficient and Less Centralized Federated Learning
Li Chou, Zichang Liu, , Anshumali Shrivastava
ECML-PKDD'21
Intra-host Rate Control with Centralized Approach
, Ke Liu, Yifan Shen, Jack YB Lee, Mingyu Chen, Lixin Zhang
CLUSTER'16 | IEEE International Conference on Cluster Computing
A Novel Approach for All-to-All Routing in All-optical Hypersquare Torus Network
, Ke Liu, Long Li, Weiyi Chen, Mingyu Chen, Lixin Zhang
CF'16 | ACM International Conference on Computing Frontiers
Adaptive Rate Control over Mobile Data Networks with Heuristic Rate Compensations
Ke Liu, , Jack Y. B. Lee, Mingyu Chen, Lixin Zhang
IWQoS'16 | IEEE/ACM International Symposium on Quality of Service
Experiences
Amazon Web Services 09/2022 - Present
Applied Scientist Intern
Designing a general framework for fault tolerance in large scale model training.
ByteDance 02/2021 - 08/2022
Research Intern, mentored by Yibo Zhu and Haibin Lin
Designed a general framework to accelerate compression-enabled distributed training by searching for the optimal compression strategy.