• Home
  • Distributed Load Balancing for Large-Scale AI Systems - NUS Business School
Distributed Load Balancing for Large-Scale AI Systems
In "Seminars and talks"

Speakers

Wenxin Zhang
Wenxin Zhang

Columbia Business School

Wenxin Zhang is a PhD candidate in the Decision, Risk, and Operations Division at Columbia Business School. Her research focuses on dynamic resource allocation in large-scale service systems, with emphasis on improving the efficiency of modern AI and large language model serving. She bridges theory and practice through collaborations with Google Research and ParkHub. Her work has been published in Operations Research, ACM EC, and NeurIPS, and was recognized as a finalist for the 2025 Applied Probability Society Best Student Paper Competition. She received her B.E. in Industrial Engineering from Tsinghua University.


Date:
Tuesday, 2 December 2025
Time:
10:00 am - 11:45 am
Venue:
NUS Business School
Mochtar Riady Building BIZ1 0302
15 Kent Ridge Drive
Singapore 119245 (Map)

Abstract

Modern AI services rely on vast, expensive computational resources and processing power, all sustained by a global network of data centers. A key operational challenge is ensuring these systems remain responsive by efficiently routing requests. This talk introduces the Greatest Marginal Service Rate (GMSR) policy, a novel policy that routes each request to the data center where it will have the highest marginal impact on the current service rate. GMSR is designed to be fully distributed, with routers in different geographic regions making decisions independently using only local information. This design makes the system scalable and resilient while eliminating the need for complex coordination. We prove that, despite its distributed design, the GMSR policy converges to the globally optimal solution that a central coordinator would choose to minimize system-wide latency. Furthermore, the policy is robust: even when the system is overloaded, it maximizes throughput and minimizes latency for all completed requests. This work provides a practical and provably optimal load balancer for building the next generation of scalable and responsive AI systems.