In this post, I explore machine learning through four fundamental questions, viewed through the lens of manifold learning.
Q1: Why do deep neural networks work so well?
Before deep learning dominated, we relied on traditional methods like SVMs [Cortes & Vapnik, 1995], Random Forests [Breiman, 2001], and simple backpropagation networks [Rumelhart et al., 1986]. These models were small and simple, designed to find direct connections between inputs and outputs.
From a manifold learning perspective, the solutions we’re modeling lie on a manifold. Traditional methods worked well for simpler problems (often convex), but struggled with complex tasks like image interpretation where inputs have tens of thousands of dimensions.
Deep neural networks shift us from convex to non-convex optimization—harder to train, but capable of solving remarkably difficult problems in vision, language, and speech.
So why do deep neural networks work so well? First (1), the tasks we tackle possess genuine underlying patterns that can be captured by learning algorithms. Second (2), their use of non-linear activation functions—such as ReLU [Nair & Hinton, 2010], GELU [Hendrycks & Gimpel, 2016], SELU [Klambauer et al., 2017], and SwiGLU [Shazeer, 2020]—enables the modeling of complex, non-linear relationships that are otherwise difficult for humans or linear models to capture. Third (3), advances in hardware, notably GPUs, allow for massive parallel computation, making it feasible to train very large models efficiently. Finally (4), deep networks are able to automatically learn hierarchical representations of data, finding effective structures that might be difficult for humans to articulate explicitly.
Another perspective on the success of Deep Neural Networks comes from rethinking generalization. The classical view of the bias-variance trade-off suggests a U-shaped risk curve, where increasing model capacity eventually leads to overfitting. However, modern deep learning operates in a regime where high-capacity models continue to generalize well, often described by the “double descent” curve.
This phenomenon is explored in [Zhang et al., 2017]. In their paper, they prove:
“There exists a two-layer neural network with ReLU activations and 2n + d weights that can represent any function on a sample of size n in d dimensions.”
Observations from [Zhang et al., 2017]
- The effective capacity of neural networks is sufficient for memorizing the entire data set.
- Even optimization on random labels remains easy. In fact, training time increases only by a small constant factor compared with training on the true labels.
- Randomizing labels is solely a data transformation, leaving all other properties of the learning problem unchanged.
Following up theoretical works have sought to explain why simple algorithms like SGD can navigate the highly non-convex loss landscapes of deep neural networks to find global minima. A significant breakthrough came from [Allen-Zhu et al., 2019], who provided a convergence theory for deep learning based on over-parameterization.
Convergence Theory for Deep Learning [Allen-Zhu et al., 2019]
- Non-degenerate inputs: No two data points are identical.
- Over-parameterization: The network width is sufficiently large (polynomial in the number of layers \(L\) and samples \(n\)).
Their analysis reveals two properties of the optimization landscape near random initialization:
- Almost-Convexity: In a sufficiently large neighborhood of random initialization, the landscape behaves similarly to a convex function.
- Semi-Smoothness: Despite the non-smoothness of ReLU activations, the objective function is “semi-smooth,” allowing gradient-based methods to succeed.
This theory also establishes a formal equivalence between sufficiently over-parameterized neural networks and the Neural Tangent Kernel (NTK), bridging the gap between deep learning and kernel methods in specific regimes.
Q2: Why state-of-the-art models still show far less generalizability than human beings?
(Content to be added)
Q3: Is it possible to build deep neural networks to solve real world inversion problem?
(Content to be added)
Q4: Are we still suffering from the bitter lessons?
(Content to be added)
References
-
Welch Labs (2025). Why Deep Learning Works Unreasonably Well [How Models Learn Part 3] [YouTube Video].
-
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536.
-
Nair, V., & Hinton, G. E. (2010). Rectified Linear Units Improve Restricted Boltzmann Machines. In Proceedings of the 27th International Conference on Machine Learning (ICML) (pp. 807–814).
-
Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415.
-
Klambauer, G., Unterthiner, T., Mayr, A., & Hochreiter, S. (2017). Self-Normalizing Neural Networks. In Advances in Neural Information Processing Systems (NeurIPS) (Vol. 30).
-
Shazeer, N. (2020). GLU Variants Improve Transformer. arXiv preprint arXiv:2002.05202.
-
Li, H., Xu, Z., Taylor, G., Studer, C., & Goldstein, T. (2018). Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems (Vol. 31). Curran Associates, Inc.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-
Huang, G., Liu, Z., Weinberger, K. Q., & van der Maaten, L. (2017). Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
-
Cortes, C., & Vapnik, V. (1995). Support-Vector Networks. Machine Learning, 20(3), 273–297.
-
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
-
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2017). Understanding Deep Learning Requires Rethinking Generalization. In International Conference on Learning Representations (ICLR).
-
Allen-Zhu, Z., Li, Y., & Song, Z. (2019). A Convergence Theory for Deep Learning via Over-Parameterization. International Conference on Machine Learning (ICML).