Wencong Xiao

肖文聪

AI Systems Researcher

About

I am an AI infrastructure engineer and researcher dedicated to building scalable, efficient, and reliable infrastructure to accelerate the arrival of AGI. I currently lead the training infrastructure team at ByteDance Seed, where I oversee research and engineering efforts for large-scale AI infrastructure systems.

Infrastructure is critical for AGI exploration. I enjoy building simple and reliable GPU systems hands-on, providing elegant interfaces to support complex and rapidly evolving AI workloads.

Before joining ByteDance, I worked at Alibaba Cloud's PAI machine learning platform team, focusing on large-scale GPU cluster management. I received my Ph.D. from Beihang University in a joint program with Microsoft Research Asia, supervised by Lidong Zhou (MSR) and Prof. Wei Li (Beihang).

35+

Publications

4000+

Citations

Experience

Seed Training Infra Lead

ByteDance Seed

2026 – Present

Leading training infrastructure for large-scale AI systems

AI-Infra Engineer

Alibaba Group, PAI Team

2019 – 2025

Deep learning infrastructure and large-scale cluster management

Research Intern

Microsoft Research Asia & Microsoft Research Redmond

2013 – 2019 (5+ years)

Distributed machine learning, GPU cluster scheduling, graph computing

Education

Doctor of Philosophy in Computer Science

Beihang University (北京航空航天大学)

2014 – 2019

Distributed Systems | Supervisors: Prof. Wei Li, Lidong Zhou (MSR)

Bachelor in Computer Science

Beihang University (北京航空航天大学)

2010 – 2014

Selected Publications

View all publications on Google Scholar

Robust LLM Training Infrastructure at ByteDance

B Wan, G Liu, Z Song, J Wang, Y Zhang, G Sheng, S Wang, H Wei, ..., Wencong Xiao, ...

SOSP 2025

📄 Paper

Llumnix: Dynamic Scheduling for Large Language Model Serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, Wei Lin

OSDI 2024

📄 Paper

Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache

Bin Lin, Chen Zhang, Tao Peng, Hanyu Zhao, Wencong Xiao, Minmin Sun, Anmin Liu, Zhipeng Zhang, Lanbo Li, Xiafei Qiu, Shen Li, Zhigang Ji, Tao Xie, Yong Li, Wei Lin

arXiv 2024

📄 Paper

MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters

Qizhen Weng, Wencong Xiao, Yinghao Yu, Wei Wang, Cheng Wang, Jian He, Yong Li, Liping Zhang, Wei Lin, Yu Ding

NSDI 2022

📄 Paper 📊 Dataset

AntMan: Dynamic Scaling on GPU Clusters for Deep Learning

Wencong Xiao, Shiru Ren, Yong Li, Yang Zhang, Pengyang Hou, Zhi Li, Yihui Feng, Wei Lin, Yangqing Jia

OSDI 2020

📄 Paper

An Empirical Study on Program Failures of Deep Learning Jobs 🏆 Distinguished Paper Award

Ru Zhang, Wencong Xiao, Hongyu Zhang, Yu Liu, Haoxiang Lin, Mao Yang

ICSE 2020

📄 Paper

Efficient and Effective Sparse LSTM on FPGA with Bank-Balanced Sparsity

Shijie Cao, Chen Zhang, Zhuliang Yao, Wencong Xiao, Lanshun Nie, Dechen Zhan, Yunxing Liu, Ming Wu, Lintao Zhang

FPGA 2019

📄 Paper

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

Myeongjae Jeon, Shivaram Venkataraman, Amar Phanishayee, Junjie Qian, Wencong Xiao, Fan Yang

USENIX ATC 2019

📄 Paper 📊 Dataset

Gandiva: Introspective Cluster Scheduling for Deep Learning

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, Lidong Zhou

OSDI 2018

📄 Paper

TuX²: Distributed Graph Computation for Machine Learning

Wencong Xiao, Jilong Xue, Youshan Miao, Cheng Chen, Zhen Li, Ming Wu, Wei Li, Lidong Zhou

NSDI 2017

📄 Paper