Why do neural models need so many parameters?
One of the striking aspects of modern neural networks is their extreme size reaching billions or even trillions parameters.
Why are so many parameters needed? To attempt an answer to this question, I will discuss an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss for linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly.
Furthermore, I will provide empirical evidence that optimal performance of realistic models is typically achieved in the ``modern'' regime, when they are trained below the noise level.
Collaborators: Nikhil Ghosh and Like Hui.