Massachusetts Institute of Technology
Zijie Zhou is a fourth-year PhD candidate at MIT Laboratory for Information and Decision Systems (LIDS) & Operations Research Center (ORC). His advisors are Professor Patrick Jaillet from MIT EECS and Professor Chara Podimata from MIT Sloan. His research expertise lies in online algorithm design, online optimization, experiment design, and large language model inference. In the summer of 2024, he interned at Microsoft Research and Azure, where he focused on developing foundational models and algorithms for LLM inference. In the summer of 2023, he interned at Oracle Lab, where he designed robust booking limit policies and dynamic upgrading mechanisms for the hospitality industry in Europe.
Date: |
Monday, 11 November 2024 |
Time: |
10:00 am - 11:30 am |
Venue: |
NUS Business School Mochtar Riady Building BIZ1 0205 15 Kent Ridge Drive Singapore 119245 (Map) |
Large Language Model (LLM) inference involves the computational techniques and strategies used to process input prompts and generate responses through a large language model. This field intersects significantly with online optimization, particularly in areas like online batching, scheduling, and resource allocation, making it a topic of keen interest to the OR/OM community. In this talk, I will introduce a foundational model for managing computational tasks on a single GPU, focusing on reducing redundant computations through a special memory-saving mechanism. This mechanism temporarily stores information from each word the model processes into the KV (key-value) cache to avoid recalculating it repeatedly. However, as more words are processed, this storage can quickly reach its limit. When this happens, the system incurs substantial extra costs by reprocessing tasks. We optimize batching and scheduling strategies to manage KV cache memory usage and minimize the inference latency to improve efficiency and sustainability.
We address this challenge by first analyzing a semi-online model, where all prompts arrive initially and must be processed sequentially. For this case, we develop a polynomial-time algorithm that achieves exact optimality. Next, we examine the fully online setting with sequential prompt arrivals. For adversarial sequences, we demonstrate that no algorithm can achieve a constant competitive ratio. For stochastic arrivals, we present a fast algorithm that guarantees constant regret, using a novel framework based on compensated coupling to prove it. Finally, Using the Vidur simulator on a public conversation dataset, we compare our algorithm to benchmark algorithms on 2 linked A100 GPUs with the Llama-70B model. After optimizing benchmark parameters, we find that in high-demand scenarios, our algorithm’s average latency grows only one-third as fast as the best benchmark, and in low-demand cases, only one-eighth as fast.