Main content start
Seminar
Fine-tuning LLMs via policy gradient algorithms
Speaker
Shengtong Zhang (Stanford & Cursor)
Date
Wed, May 20 2026, 2:00pm
Location
384H
Starting from first principles, I will derive (a variant of) the GRPO algorithm, one of the most widely used algorithms for post-training large language models. Then I will sketch how this algorithm is implemented at scale. Finally, I will briefly describe an important open problem known as training--inference mismatch.