Deep Linear Networks

Rathindra Nath Karmakar

Session 5: Open Problems

Recall: The Deep Linear Network (DLN) Setup

We study a deep linear network that produces an end-to-end matrix $W$ by multiplying $N$ factor matrices:

$$W = \phi(\mathbf{W}) = W_N W_{N-1} \cdots W_1$$

Training dynamics are modeled by a gradient flow on a loss function $E(W)$ (e.g., squared error for matrix completion). This induces a flow on the factor matrices $\mathbf{W} = (W_N, \dots, W_1)$.

\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

The key insight is that under balanced initialization, this flow is a Riemannian gradient flow on the "downstairs" manifold of end-to-end matrices $(\mathcal{M}_d, g^N)$:

$$ \frac{dW}{dt} = -\grad_{g^N} E(W) $$

The geometry is encoded by the metric $g^N$, which depends on the network depth $N$ and the current matrix $W$.

A Landscape of Open Questions

The geometric theory of the DLN uncovers deep and challenging questions at the intersection of dynamical systems, geometry, and machine learning. We will explore:

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Recent Progress & Directions

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Recent Progress & Directions

Reference: Baekrok Shin and Chulhee Yun, “Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions,” ICML 2025 Workshop on High-dimensional Learning Dynamics (HiLD), 2025. https://openreview.net/forum?id=MzGS2mFn8N

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Recent Progress & Directions

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Recent Progress & Directions

Reference: Baekrok Shin and Chulhee Yun, Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions, HiLD 2025: 3rd Workshop on High-dimensional Learning Dynamics, 2025. URL: https://openreview.net/forum?id=MzGS2mFn8N.

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Recent Progress & Directions

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Recent Progress & Directions

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Recent Progress & Directions

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Recent Progress & Directions

Reference: Z. Veraszto, The Deep Linear Network—Dynamics, Riemannian Geometry and Overparametrization, Ph.D. thesis, Brown University, 2023.

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Recent Progress & Directions

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Recent Progress & Directions

Reference: T. Yu, S. Zheng, J. Lu, G. Menon, X. Zhang, “Riemannian Langevin Monte Carlo schemes for sampling PSD matrices with fixed rank,” Arxiv preprint, 2025. https://arxiv.org/pdf/2309.04072

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Recent Progress & Directions

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Recent Progress & Directions

Reference: Ching-Peng Huang, Dominik Inauen, and Govind Menon, “Motion by mean curvature and Dyson Brownian Motion,” Electronic Communications in Probability 28 (2023), 1–10. https://doi.org/10.1214/23-ECP540

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Reference: A. M. Bloch, R. W. Brockett, T. S. Ratiu, “Completely integrable gradient flows,” Communications in Mathematical Physics 147 (1992), 57-74. https://link.springer.com/article/10.1007/BF02099528

Reference: S. Marcotte, R. Gribonval, G. Peyré, “Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows,” NeurIPS 36 (2024). https://openreview.net/forum?id=kMueEV8Eyy

Problem 9: Rigorous Theory for "Jumps in Rank"

The Problem

Numerical experiments show that the effective rank of $W(t)$ often plateaus for long periods and then drops suddenly. This mirrors the "scale-by-scale" fitting observed in practical deep learning.

Question: Can this phenomenon of "sudden jumps" in rank be rigorously formulated and proven?

Thank You!

Any Questions?