I am an Applied Scientist at Amazon Web Services AI. I received my Ph.D. degree in Computer Science from Rice University, fortunately advised by Prof. T. S. Eugene Ng. My research interests focus on Machine Learning Systems and Data Center Networking. I am now working on Fault Tolerance and Elastic Training for large language model training.

Previously, I received my MS degree from Institute of Computing Technology Chinese Academy of Sciences (ICT, CAS) and my BS degree in Computer Science from Huazhong University of Science and Technology (HUST).

Preprint

Zen: Near-Optimal Communications for Sparse and Distributed DNN
Zhuang Wang*, Zhaozhuo Xu*, Anshumali Shrivastava, T. S. Eugene Ng
Under review

Publications

(* = Equal Contribution)

GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints [slides]
Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, Yida Wang
SOSP '23 | ACM Symposium on Operating Systems Principles

Cupcake: A Compression Scheduler for Scalable Communication-Efficient Distributed Training [code]
Zhuang Wang, Xinyu Crystal Wu, Zhaozhuo Xu, T. S. Eugene Ng MLSys '23 | Conference on Machine Learning and Systems

Espresso: Unleashing the Full Potential of Gradient Compression with Near-Optimal Usage Strategies [code] [slides]
Zhuang Wang, Haibin Lin, Yibo Zhu, T. S. Eugene Ng EuroSys '23 | The European Conference on Computer Systems

DRAGONN: Distributed Randomized Approximate Gradients of Neural Networks [code]
Zhuang Wang*, Zhaozhuo Xu*, Xinyu Crystal Wu, Anshumali Shrivastava, T. S. Eugene Ng ICML '22 | International Conference on Machine Learning

Shufflecast: An Optical, Data-rate Agnostic and Low-Power Multicast Architecture for Next-Generation Compute Clusters
Sushovan Das, Afsaneh Rahbar, Xinyu Crystal Wu, Zhuang Wang, Weitao Wang, Ang Chen, T. S. Eugene Ng
ToN '22 | IEEE/ACM Transactions on Networking

DMXDAG: A Hybrid Abstraction for Emerging Applications
Weitao Wang, Sushovan Das, Xinyu Crystal Wu, Zhuang Wang, Ang Chen, T. S. Eugene Ng HotNets '21 | ACM Workshop on Hot Topics in Networks

Efficient and Less Centralized Federated Learning
Li Chou, Zichang Liu, Zhuang Wang, Anshumali Shrivastava
ECML-PKDD '21

Intra-host Rate Control with Centralized Approach
Zhuang Wang, Ke Liu, Yifan Shen, Jack YB Lee, Mingyu Chen, Lixin Zhang CLUSTER '16 | IEEE International Conference on Cluster Computing

A Novel Approach for All-to-All Routing in All-optical Hypersquare Torus Network
Zhuang Wang, Ke Liu, Long Li, Weiyi Chen, Mingyu Chen, Lixin Zhang CF '16 | ACM International Conference on Computing Frontiers

Adaptive Rate Control over Mobile Data Networks with Heuristic Rate Compensations
Ke Liu, Zhuang Wang, Jack Y. B. Lee, Mingyu Chen, Lixin Zhang IWQoS '16 | IEEE/ACM International Symposium on Quality of Service

Experiences

Amazon Web Services AI 10/2023 - present  
Applied Scientist

Amazon Web Services 09/2022 - 06/2023
Applied Scientist Intern, mentored by Yida Wang and Zhen Jia
Designed a general framework for fault tolerance in large model training.

ByteDance 02/2021 - 08/2022
Research Intern, mentored by Yibo Zhu and Haibin Lin
Designed a general framework to accelerate compression-enabled distributed training by searching for the optimal compression strategy.

HUST
2010 - 2014
ICT, CAS
2014-2017
KAUST
S2019
ByteDance
2021-2022
AWS
2022-2023
Rice University
2019-2023
AWS
2023-present