In this post, I explore machine learning through four fundamental questions, viewed through the lens of manifold learning.

Q1: Why do deep neural networks work so well?

Before deep learning dominated, we relied on traditional methods like SVMs [Cortes & Vapnik, 1995], Random Forests [Breiman, 2001], and simple backpropagation networks [Rumelhart et al., 1986]. These models were small and simple, designed to find direct connections between inputs and outputs.

From a manifold learning perspective, the solutions we’re modeling lie on a manifold. Traditional methods worked well for simpler problems (often convex), but struggled with complex tasks like image interpretation where inputs have tens of thousands of dimensions.

Deep neural networks shift us from convex to non-convex optimization—harder to train, but capable of solving remarkably difficult problems in vision, language, and speech.

Loss Landscape of different neural networks Even in the sense of DNNs, the loss landscape would vary. Left: ResNet-101[He et al., 2016] appears more non-convex; Right: DenseNet [Huang et al., 2017] appears more convex. Image Credit: [Li et al., 2018].

So why do deep neural networks work so well? First (1), the tasks we tackle possess genuine underlying patterns that can be captured by learning algorithms. Second (2), their use of non-linear activation functions—such as ReLU [Nair & Hinton, 2010], GELU [Hendrycks & Gimpel, 2016], SELU [Klambauer et al., 2017], and SwiGLU [Shazeer, 2020]—enables the modeling of complex, non-linear relationships that are otherwise difficult for humans or linear models to capture. Third (3), advances in hardware, notably GPUs, allow for massive parallel computation, making it feasible to train very large models efficiently. Finally (4), deep networks are able to automatically learn hierarchical representations of data, finding effective structures that might be difficult for humans to articulate explicitly.

Linear regions in neural networks
Comparison of shallow vs deep networks
The number of linear regions \(N_r\) grows exponentially with the depth of the network (Left). Comparison of linear regions created by shallow vs deep networks (Right). Image Credit: Welch Labs. [Welch Labs, 2025]

Another perspective on the success of Deep Neural Networks comes from rethinking generalization. The classical view of the bias-variance trade-off suggests a U-shaped risk curve, where increasing model capacity eventually leads to overfitting. However, modern deep learning operates in a regime where high-capacity models continue to generalize well, often described by the “double descent” curve.

Curves for training risk (dashed line) and test risk (solid line). (A) The classical U-shaped risk curve arising from the bias–variance trade-off. (B) The double-descent risk curve, which incorporates the U-shaped risk curve (i.e., the “classical” regime) together with the observed behavior from using high-capacity function classes (i.e., the “modern” interpolating regime), separated by the interpolation threshold. The predictors to the right of the interpolation threshold have zero training risk. Image Credit: [Zhang et al., 2017].

This phenomenon is explored in [Zhang et al., 2017]. In their paper, they prove:

“There exists a two-layer neural network with ReLU activations and 2n + d weights that can represent any function on a sample of size n in d dimensions.”

Observations from [Zhang et al., 2017]

  1. The effective capacity of neural networks is sufficient for memorizing the entire data set.
  2. Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
  3. Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.

Following up theoretical works have sought to explain why simple algorithms like SGD can navigate the highly non-convex loss landscapes of deep neural networks to find global minima. A significant breakthrough came from [Allen-Zhu et al., 2019], who provided a convergence theory for deep learning based on over-parameterization.

Convergence Theory for Deep Learning [Allen-Zhu et al., 2019]

  1. Non-degenerate inputs: No two data points are identical.
  2. Over-parameterization: The network width is sufficiently large (polynomial in the number of layers \(L\) and samples \(n\)).

Their analysis reveals two properties of the optimization landscape near random initialization:

  1. Almost-Convexity: In a sufficiently large neighborhood of random initialization, the landscape behaves similarly to a convex function.
  2. Semi-Smoothness: Despite the non-smoothness of ReLU activations, the objective function is “semi-smooth,” allowing gradient-based methods to succeed.
Landscapes of the CIFAR10 image-classification training objective near the SGD training trajectory. The blue vertical stick marks the current point at the current iteration. The landscape appears ‘almost convex’ in the neighborhood of the trajectory. Image Credit: [Allen-Zhu et al., 2019].

This theory also establishes a formal equivalence between sufficiently over-parameterized neural networks and the Neural Tangent Kernel (NTK), bridging the gap between deep learning and kernel methods in specific regimes.

Q2: Why state-of-the-art models still show far less generalizability than human beings?

(Content to be added)

Q3: Is it possible to build deep neural networks to solve real world inversion problem?

(Content to be added)

Q4: Are we still suffering from the bitter lessons?

(Content to be added)


References

  1. Welch Labs (2025). Why Deep Learning Works Unreasonably Well [How Models Learn Part 3] [YouTube Video].

  2. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.

  3. Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML) (pp. 807–814).

  4. Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.

  5. Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 30).

  6. Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202.

  7. Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems (Vol. 31). Curran Associates, Inc.

  8. He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  9. Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

  10. Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273–297.

  11. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.

  12. Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding Deep Learning Requires Rethinking Generalization. In International Conference on Learning Representations (ICLR).

  13. Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A Convergence Theory for Deep Learning via Over-Parameterization. International Conference on Machine Learning (ICML).