Wencong Xiao

肖文聪

AI Systems Researcher

About

I am an AI infrastructure engineer and researcher dedicated to building scalable, efficient, and reliable infrastructure to accelerate the arrival of AGI. I currently lead the training infrastructure team at ByteDance Seed, where I oversee research and engineering efforts for large-scale AI infrastructure systems.

Infrastructure is critical for AGI exploration. I enjoy building simple and reliable GPU systems hands-on, providing elegant interfaces to support complex and rapidly evolving AI workloads.

Before joining ByteDance, I worked at Alibaba Cloud's PAI machine learning platform team, focusing on large-scale GPU cluster management. I received my Ph.D. from Beihang University in a joint program with Microsoft Research Asia, supervised by Lidong Zhou (MSR) and Prof. Wei Li (Beihang).

35+
Publications
4000+
Citations

Experience

Seed Training Infra Lead
ByteDance Seed
2026 – Present
Leading training infrastructure for large-scale AI systems
AI-Infra Engineer
Alibaba Group, PAI Team
2019 – 2025
Deep learning infrastructure and large-scale cluster management
Research Intern
Microsoft Research Asia & Microsoft Research Redmond
2013 – 2019 (5+ years)
Distributed machine learning, GPU cluster scheduling, graph computing

Education

Doctor of Philosophy in Computer Science
Beihang University (北京航空航天大学)
2014 – 2019
Distributed Systems | Supervisors: Prof. Wei Li, Lidong Zhou (MSR)
Bachelor in Computer Science
Beihang University (北京航空航天大学)
2010 – 2014

Selected Publications

View all publications on Google Scholar

Robust LLM Training Infrastructure at ByteDance
B Wan, G Liu, Z Song, J Wang, Y Zhang, G Sheng, S Wang, H Wei, ..., Wencong Xiao, ...
SOSP 2025
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin
OSDI 2024
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin
arXiv 2024
MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters
Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, Yu Ding
NSDI 2022
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, Yangqing Jia
OSDI 2020
An Empirical Study on Program Failures of Deep Learning Jobs 🏆 Distinguished Paper Award
Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, Mao Yang
ICSE 2020
Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity
Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxing Liu, Ming Wu, Lintao Zhang
FPGA 2019
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang
USENIX ATC 2019
Gandiva: Introspective Cluster Scheduling for Deep Learning
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, Lidong Zhou
OSDI 2018
TuX²: Distributed Graph Computation for Machine Learning
Wencong Xiao, Jilong Xue, Youshan Miao, Cheng Chen, Zhen Li, Ming Wu, Wei Li, Lidong Zhou
NSDI 2017