Research Problems in Pretraining
Pretraining a large language model means solving a single optimization problem — minimize next-token prediction loss — at a scale where each design choice costs millions of dollars in compute and months of GPU time. The resulting loss curves exhibit remarkably clean power-law structure, yet practitioners navigate this landscape largely by intuition. After a brief practitioner's overview of the problem areas in pretraining, we will take a deeper look scaling laws, how it is applied in experiment design in practice, and the challenges in these applications. I will go through areas with sharp mathematical structures such as principled scaling and optimization methods. I will close by discussing about some of the unsolved mysteries in pretraining.