I am a PhD student at CS department of Rice University, fortunately advised by Prof. T. S. Eugene Ng. My research interests focus on Machine Learning Systems and Data Center Networking. I am now working on Fault Tolerance and Elastic Training for large scale model training.

I received my MS degree from Institute of Computing Technology Chinese Academy of Sciences (ICT, CAS) in 2017 and my BS degree in Computer Science from Huazhong University of Science and Technology (HUST) in 2014.

Preprint

Zen: Near-Optimal Communications for Sparse and Distributed DNN
Zhuang Wang, Zhaozhuo Xu, Anshumali Shrivastava, T. S. Eugene Ng
Submitted to Sigcomm'23

Augmented Queue: A Salable In-Network Abstraction for Data Center Network Sharing
Xinyu Crystal Wu*, Zhuang Wang*, Weitao Wang, T. S. Eugene Ng (* = Equal Contribution)
Submitted to Sigcomm'23

Publications

Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training
Zhuang Wang, Xinyu Crystal Wu, Zhaozhuo Xu, T. S. Eugene Ng MLSys'23 (to appear) | Conference on Machine Learning and Systems

Hi-Speed DNN Training with Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies [pdf]
Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng EuroSys'23 | The European Conference on Computer Systems

DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks [pdf]
Zhuang Wang, Zhaozhuo Xu, Xinyu Crystal Wu, Anshumali Shrivastava, T. S. Eugene Ng ICML'22 | International Conference on Machine Learning

Shufflecast: An Optical, Data-rate Agnostic and Low-Power Multicast Architecture for Next-Generation Compute Clusters
Sushovan Das, Afsaneh Rahbar, Xinyu Crystal Wu, Zhuang Wang, Weitao Wang, Ang Chen, T. S. Eugene Ng
ToN'22 | IEEE/ACM Transactions on Networking

DMXDAG: A Hybrid Abstraction for Emerging Applications
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, Zhuang Wang, Ang Chen, T. S. Eugene Ng HotNets'21 | ACM Workshop on Hot Topics in Networks

Efficient and Less Centralized Federated Learning
Li Chou, Zichang Liu, Zhuang Wang, Anshumali Shrivastava
ECML-PKDD'21

Intra-host Rate Control with Centralized Approach
Zhuang Wang, Ke Liu, Yifan Shen, Jack YB Lee, Mingyu Chen, Lixin Zhang CLUSTER'16 | IEEE International Conference on Cluster Computing

A Novel Approach for All-to-All Routing in All-optical Hypersquare Torus Network
Zhuang Wang, Ke Liu, Long Li, Weiyi Chen, Mingyu Chen, Lixin Zhang CF'16 | ACM International Conference on Computing Frontiers

Adaptive Rate Control over Mobile Data Networks with Heuristic Rate Compensations
Ke Liu, Zhuang Wang, Jack Y. B. Lee, Mingyu Chen, Lixin Zhang IWQoS'16 | IEEE/ACM International Symposium on Quality of Service

Experiences

Amazon Web Services 09/2022 - Present
Applied Scientist Intern
Designing a general framework for fault tolerance in large scale model training.

ByteDance 02/2021 - 08/2022
Research Intern, mentored by Yibo Zhu and Haibin Lin
Designed a general framework to accelerate compression-enabled distributed training by searching for the optimal compression strategy.

HUST
2010 - 2014
ICT, CAS
2014-2017
ANU
F2017
KAUST
S2019
ByteDance
2021-2022
AWS
2022-Present
Rice University
2019-Present