Deep Linear Networks

Rathindra Nath Karmakar

Session 5: Open Problems

Recall: The Deep Linear Network (DLN) Setup

We study a deep linear network that produces an end-to-end matrix $W$ by multiplying $N$ factor matrices:

$$W = \phi(\mathbf{W}) = W_N W_{N-1} \cdots W_1$$

Training dynamics are modeled by a gradient flow on a loss function $E(W)$ (e.g., squared error for matrix completion). This induces a flow on the factor matrices $\mathbf{W} = (W_N, \dots, W_1)$.

\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

The key insight is that under balanced initialization, this flow is a Riemannian gradient flow on the "downstairs" manifold of end-to-end matrices $(\mathcal{M}_d, g^N)$:

$$ \frac{dW}{dt} = -\grad_{g^N} E(W) $$

The geometry is encoded by the metric $g^N$, which depends on the network depth $N$ and the current matrix $W$.

A Landscape of Open Questions

The geometric theory of the DLN uncovers deep and challenging questions at the intersection of dynamical systems, geometry, and machine learning. We will explore:

The Nature of Convergence: Where does the flow go, and why?
The Role of Randomness: How does noise shape the outcome?
Extending the Theory: What happens at the boundaries of the model (low rank, infinite depth)?
Hidden Mathematical Structures: Are there other physical or mathematical principles at play?

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Recent Progress & Directions

Progress: Shin & Yun (2025) prove that for diagonal observations as above, gradient flow in deep matrix factorization converges to a rank-1 minimizer iff the dynamics are coupled (which holds for depth $L \ge 3$ except for purely diagonal initializations).

Problem 1: Convergence to a Low-Rank Matrix

The Problem

For matrix completion, even in simple cases, the dynamics show a strong preference for low-rank solutions. Can we prove this rigorously?

Example: Consider a $2 \times 2$ matrix completion task where we observe $W_{11} = 1$ and $W_{22} = 1$. The energy is $E(W) = \frac{1}{2}((W_{11}-1)^2 + (W_{22}-1)^2)$.

The set of zero-energy solutions is any matrix of the form $W = \begin{pmatrix} 1 & a \\ b & 1 \end{pmatrix}$.

The rank-1 solutions in this set satisfy $1-ab=0$, forming a hyperbola $b=1/a$.

Question: Can one establish that the solution $W(t)$ to the Riemannian gradient flow converges to a rank-1 minimizer as $t \to \infty$?

Recent Progress & Directions

Progress: Shin & Yun (2025) prove that for diagonal observations as above, gradient flow in deep matrix factorization converges to a rank-1 minimizer iff the dynamics are coupled (which holds for depth $L \ge 3$ except for purely diagonal initializations).
Directions: Extend the coupling-based analysis to depth $L=2$ and strictly diagonal initializations.

Reference: Baekrok Shin and Chulhee Yun, “Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions,” ICML 2025 Workshop on High-dimensional Learning Dynamics (HiLD), 2025. https://openreview.net/forum?id=MzGS2mFn8N

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Recent Progress & Directions

Progress: Shin & Yun (2025) prove that for diagonal observations as above, gradient flow in deep matrix factorization converges to a rank-1 minimizer iff the dynamics are coupled (which holds for depth $L \ge 3$ except for purely diagonal initializations).

Problem 2: Convergence to Other Minimizers

The Problem

The zero-energy set from Problem 1 also contains full-rank matrices (e.g., the identity matrix). Is it possible for the gradient flow to converge to these solutions?

Question: Are there initial conditions $W(0)$ with positive energy ($E(W(0)) > 0$) for which the solution $W(t)$ converges to a full-rank minimizer of $E$? Or is the bias towards low rank absolute?

Recent Progress & Directions

Progress: Shin & Yun (2025) prove that for diagonal observations as above, gradient flow in deep matrix factorization converges to a rank-1 minimizer iff the dynamics are coupled (which holds for depth $L \ge 3$ except for purely diagonal initializations).
Directions: Extend this analysis beyond diagonal observations and remove remaining technical assumptions (e.g., the “assume convergence” step in the depth-2 case) to characterize basins of attraction for full-rank vs. low-rank minimizers more generally.

Reference: Baekrok Shin and Chulhee Yun, Implicit Bias and Loss of Plasticity in Matrix Completion: Depth Promotes Low-Rank Solutions, HiLD 2025: 3rd Workshop on High-dimensional Learning Dynamics, 2025. URL: https://openreview.net/forum?id=MzGS2mFn8N.

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Recent Progress & Directions

Progress: Exact formulas for the entropy $S(W)$ are known (Thm. 10), providing the necessary components to define $F_{\beta}$.

Problem 3: The Free Energy Landscape

The Problem

Adding entropy to the energy gives the free energy $F_{\beta}(W) = E(W) - \frac{1}{\beta}S(W)$, which governs stochastic dynamics. The entropy $S(W) = \log \vol(O_W)$ depends on the volume of the set of factorizations for $W$. How does this landscape look?

Question: What is the structure of the set of free energy minimizers, $S_{\beta} = \argmin_W F_{\beta}(W)$? How does this set change as the inverse temperature $\beta$ (noise level) and depth $N$ vary?

Recent Progress & Directions

Progress: Exact formulas for the entropy $S(W)$ are known (Thm. 10), providing the necessary components to define $F_{\beta}$.
Direction: For the simple $2 \times 2$ matrix completion task, explicitly compute the minimizers of $F_{\beta}$. This would provide a concrete model for how noise and depth interact to select a solution, potentially resolving the ambiguity left by rank-minimization alone.

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Recent Progress & Directions

Progress: Veraszto’s Ph.D. thesis reports partial results toward a differential–geometric understanding of the infinite–depth limit, specifically partial results on curvature and geodesics of the limiting metric $g^\infty$ on the balanced manifold; the thesis positions these as steps toward a full geometric theory when the “upstairs” flow is no longer well-defined.

Problem 4: The Infinite Depth Limit

The Problem

In the limit $N \to \infty$, the Riemannian metric $g^N$ converges to a limit $g^\infty$ that depends only on the "downstairs" matrix $W$. The "upstairs" space of factor matrices, which is central to the finite-depth theory, seems to vanish from the equations.

Question: Is there a limiting mathematical framework for Riemannian submersion as $N \to \infty$? What is the correct way to think about the geometry when the fiber space of factorizations disappears?

Recent Progress & Directions

Progress: Veraszto’s Ph.D. thesis reports partial results toward a differential–geometric understanding of the infinite–depth limit, specifically partial results on curvature and geodesics of the limiting metric $g^\infty$ on the balanced manifold; the thesis positions these as steps toward a full geometric theory when the “upstairs” flow is no longer well-defined.
Direction: Wait for the thesis to be published.

Reference: Z. Veraszto, The Deep Linear Network—Dynamics, Riemannian Geometry and Overparametrization, Ph.D. thesis, Brown University, 2023.

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Recent Progress & Directions

Progress: RLE has been constructed and numerically validated on a single smooth fixed-rank stratum (the manifold of $n\times n$ PSD matrices of rank $p$), the embedded metric and one under the quotient metric obtained from the low-rank factorization map $Y\mapsto YY^T$

Problems 5 & 6: The Low-Rank Setting

The Problem

Most of the rigorous results for the DLN (e.g., Riemannian submersion, RLE) rely on the matrix $W$ being full-rank. This ensures the manifold is smooth. However, the dynamics are empirically attracted to low-rank matrices, where these assumptions break down.

Question 5: How can the theories of Riemannian submersion and the Riemannian Langevin Equation (RLE) be extended to handle the singularities at low-rank matrices?

Question 6: What is the geometric structure of the G-balanced varieties $M_G$ at the singular points corresponding to low-rank matrices?

Recent Progress & Directions

Progress: RLE has been constructed and numerically validated on a single smooth fixed-rank stratum (the manifold of $n\times n$ PSD matrices of rank $p$), the embedded metric and one under the quotient metric obtained from the low-rank factorization map $Y\mapsto YY^T$
Directions: Extend the analysis to the quotient metric obtained from the end-to-end map $(W_1,W_2,\ldots W_N) \mapsto W_NW_{N-1}\ldots W_1$

Reference: T. Yu, S. Zheng, J. Lu, G. Menon, X. Zhang, “Riemannian Langevin Monte Carlo schemes for sampling PSD matrices with fixed rank,” Arxiv preprint, 2025. https://arxiv.org/pdf/2309.04072

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Recent Progress & Directions

Progress: Huang–Inauen–Menon construct Dyson Brownian motion by projecting Euclidean Brownian motion onto isospectral orbits via a Riemannian submersion, showing that the Itô correction produces a deterministic mean-curvature drift of the orbit leaves.

Problem 7: Motion by Curvature towards Balance

The Problem

The G-balanced varieties $M_G$ (where $W^T_{p+1}W_{p+1} - W_p W_p^T = G_p \neq 0$) are invariant under deterministic flow. However, noise should break this invariance. The "most balanced" state is $M_0$ (where $G_p=0$).

Conjecture: A stochastic flow (RLE) with noise tangential to the fibers, when started on an unbalanced variety $M_G$, will be driven towards the balanced variety $M_0$.

Recent Progress & Directions

Progress: Huang–Inauen–Menon construct Dyson Brownian motion by projecting Euclidean Brownian motion onto isospectral orbits via a Riemannian submersion, showing that the Itô correction produces a deterministic mean-curvature drift of the orbit leaves.
Direction: Port their submersion-based construction to the DLN fibration where fibers are factorization orbits of $W$. For RLE with fiber-tangential noise, prove that the induced normal drift on $M_G$ equals (or controls) the mean curvature of the $M_G$ leaves, yielding a monotone entropy (e.g., $\log\!\operatorname{vol}(\text{fiber})$) and convergence toward $M_0$.

Reference: Ching-Peng Huang, Dominik Inauen, and Govind Menon, “Motion by mean curvature and Dyson Brownian Motion,” Electronic Communications in Probability 28 (2023), 1–10. https://doi.org/10.1214/23-ECP540

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Progress (template for coexistence): Completely integrable systems that are simultaneously gradient-like on orbits, exist.

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Progress (template for coexistence): Completely integrable systems that are simultaneously gradient-like on orbits, exist.
Progress (conserved quantities in DLN-style flows): Conservation laws for gradient flows in neural architectures can be characterized algebraically.

Problem 8: Gradient and Hamiltonian Structures

The Problem

The DLN training dynamics are a gradient flow, which dissipates energy. However, they also possess conserved quantities ($G$ on $M_G$), a hallmark of energy-preserving Hamiltonian systems.

Question: Does the DLN possess a hidden, complementary Hamiltonian structure? This would place it in a rare class of systems that are both gradient-like and completely integrable.

Recent Progress & Directions

Progress (template for coexistence): Completely integrable systems that are simultaneously gradient-like on orbits, exist.
Progress (conserved quantities in DLN-style flows): Conservation laws for gradient flows in neural architectures can be characterized algebraically.
Direction : Check if the Hamiltonian equations can be recovered entirely from conserved quantities.

Reference: A. M. Bloch, R. W. Brockett, T. S. Ratiu, “Completely integrable gradient flows,” Communications in Mathematical Physics 147 (1992), 57-74. https://link.springer.com/article/10.1007/BF02099528

Reference: S. Marcotte, R. Gribonval, G. Peyré, “Abide by the Law and Follow the Flow: Conservation Laws for Gradient Flows,” NeurIPS 36 (2024). https://openreview.net/forum?id=kMueEV8Eyy

Problem 9: Rigorous Theory for "Jumps in Rank"

The Problem

Numerical experiments show that the effective rank of $W(t)$ often plateaus for long periods and then drops suddenly. This mirrors the "scale-by-scale" fitting observed in practical deep learning.

Question: Can this phenomenon of "sudden jumps" in rank be rigorously formulated and proven?

Thank You!

Any Questions?