Deep Linear Networks
Rathindra Nath Karmakar
Session 3 : Characterizing the Minimizer
References
- Deep Linear Networks for Matrix Completion - An Infinite Depth Limit, Nadav Cohen, Govind Menon, and Zsolt Veraszto (2023)
- Implicit Regularization in Deep Learning May Not Be Explainable by Norms, Noam Razin, Nadav Cohen (2020)
- Implicit Regularization in Matrix Factorization, Suriya Gunasekar, Blake Woodworth, Behnam Neyshabur, Srinadh Bhojanapalli, Nathan Srebro (2017)
- The geometry of the deep linear network, Govind Menon (2024)
Recap: Key Questions for Gradient Flow
For the gradient flow dynamics on a loss surface $\mathcal{L}(\mathbf{W})$:
\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]
We want to understand:
- Convergence guarantees? (Yes, for balanced cases)
- Convergence rate? (Can be accelerated by depth)
- Characterization of the minimizer reached?
- Effect of noise and discretization?
Today, we address the third question: among all possible solutions, which one does gradient descent choose?
The Central Question of Implicit Regularization
In overparameterized settings, there are many (often infinite) parameter settings that achieve zero training loss. Gradient descent finds one of them. Why that specific one?
What objective is gradient descent implicitly minimizing?
We will investigate three competing hypotheses:
- Norm Minimization
- Rank Minimization
- Volume Maximization
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
-
The goal is to find a matrix $W$ that minimizes the squared error loss:
\[
\ell(W) = \frac{1}{2} \sum_{(i,j) \in \Omega} \left(W_{i,j} - b_{i,j}\right)^2
\]
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
-
The goal is to find a matrix $W$ that minimizes the squared error loss:
\[
\ell(W) = \frac{1}{2} \sum_{(i,j) \in \Omega} \left(W_{i,j} - b_{i,j}\right)^2
\]
-
Instead of optimizing over $W$ directly, the matrix is parameterized as the product of $L$ factor matrices:
\[
W = W_L W_{L-1} \cdots W_1
\]
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
-
The goal is to find a matrix $W$ that minimizes the squared error loss:
\[
\ell(W) = \frac{1}{2} \sum_{(i,j) \in \Omega} \left(W_{i,j} - b_{i,j}\right)^2
\]
-
Instead of optimizing over $W$ directly, the matrix is parameterized as the product of $L$ factor matrices:
\[
W = W_L W_{L-1} \cdots W_1
\]
-
The factor matrices $\{W_l\}_{l=1}^L$ are then trained to minimize the loss.
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
-
The goal is to find a matrix $W$ that minimizes the squared error loss:
\[
\ell(W) = \frac{1}{2} \sum_{(i,j) \in \Omega} \left(W_{i,j} - b_{i,j}\right)^2
\]
-
Instead of optimizing over $W$ directly, the matrix is parameterized as the product of $L$ factor matrices:
\[
W = W_L W_{L-1} \cdots W_1
\]
-
The factor matrices $\{W_l\}_{l=1}^L$ are then trained to minimize the loss.
-
The optimization dynamics are studied under gradient flow:
\[
\frac{d}{dt}W_l(t) = -\nabla_{W_l} \ell(W(t))
\]
General Problem Formulation
-
The problem is to recover a matrix $W \in \mathbb{R}^{d \times d'}$ from a set of observed entries $\{b_{i,j}\}$ at locations $(i,j) \in \Omega$.
-
The goal is to find a matrix $W$ that minimizes the squared error loss:
\[
\ell(W) = \frac{1}{2} \sum_{(i,j) \in \Omega} \left(W_{i,j} - b_{i,j}\right)^2
\]
-
Instead of optimizing over $W$ directly, the matrix is parameterized as the product of $L$ factor matrices:
\[
W = W_L W_{L-1} \cdots W_1
\]
-
The factor matrices $\{W_l\}_{l=1}^L$ are then trained to minimize the loss.
-
The optimization dynamics are studied under gradient flow:
\[
\frac{d}{dt}W_l(t) = -\nabla_{W_l} \ell(W(t))
\]
-
This process starts from a random initialization close to zero that satisfies the "balancedness" condition: $W_{l+1}^T(0)W_{l+1}(0) = W_l^T(0)W_l(0)$.
Hypothesis 1: Norm Minimization
This is the classical explanation, rooted in linear regression where gradient descent with zero initialization famously converges to the minimum $\ell_2$-norm solution.
The hope is that this generalizes to deep learning, perhaps with a different norm (e.g., nuclear norm for matrices).
Conjecture 1 (Gunasekar et al., 2017). For matrix completion, gradient descent with small initialization converges to the solution with the minimum nuclear norm.
Let's test this hypothesis.
A Counterexample: Norms vs. Rank
Razin & Cohen (2020) constructed a simple 2x2 matrix completion problem to test this hypothesis. Given observations $w_{12} = 1, w_{21} = 1, w_{22} = 0$, the set of solutions is:
\[ S = \left\{ W_x = \begin{pmatrix} x & 1 \\ 1 & 0 \end{pmatrix} : x \in \mathbb{R} \right\} \]
This setup creates a direct conflict:
Minimizing any norm (e.g., Frobenius $\|W_x\|_F = \sqrt{x^2+2}$) requires $x$ to be bounded (minimum at $x=0$).
Theorem: Norm Minimization is False
The paper proves that for this problem, gradient flow on a deep matrix factorization ($L \ge 2$) drives the unobserved entry to $\infty$.
Theorem 1 (Razin & Cohen, 2020). For the matrix completion problem above, with random near-zero balanced initialization and depth $L \ge 2$, if $\det(W(0)) > 0$ (a 50% chance), then as the loss $\ell(t) \to 0$:
- For any norm or quasi-norm $\|\cdot\|$, the norm of the solution diverges: $\|W(t)\| \to \infty$.
- The effective rank of the solution converges to its minimum possible value: $\erank(W(t)) \to 1$.
This definitively shows that implicit regularization in this setting cannot be explained by the minimization of any norm.
Experimental Verification for Hypothesis 1
The theory predicts that as the loss decreases, the unobserved entry ($w_{11}$) should grow. (Fig 1 from Razin & Cohen, 2020)
The experiments confirm this: as loss decreases (moving right to left on the x-axis), the absolute value of the unobserved entry increases. Since all norms must grow with this entry, this validates the theorem and refutes the norm minimization hypothesis.
Hypothesis 2: Rank Minimization
The failure of the norm hypothesis suggests a new candidate: perhaps the implicit bias is towards minimizing **rank** (or its continuous surrogate, **effective rank**).
Hypothesis 2. Gradient descent converges to the solution with the minimum (effective) rank.
This is consistent with the previous experiment and many empirical observations. But is it the full story?
Experiment: Is Rank/Effective Rank Sufficient?
Consider a 2x2 diagonal matrix completion task where all rank-1 solutions lie on a hyperbola. (Fig 4 from Cohen, Menon, Veraszto, 2023)
Observation: The training outcomes cluster tightly around the corners of the hyperbola, indicating a strong preference for these solutions.
- All points on the red/green curves are rank-1 minimizers. A simple **rank minimization** principle fails to explain the clustering.
- The embedded histograms show the **effective rank** is also tightly clustered around 1.0 for all outcomes. Effective rank also fails to explain the preference for the corners.
Experiment: Is Rank/Effective Rank Sufficient?
Consider a 3x3 diagonal matrix completion task where all minimizers are rank-2. (Fig 8 from Cohen, Menon, Veraszto, 2023)
Observation: The majority of the 500 outputs cluster near one particular rank-two minimizer indicated by the blue dot.
The Volume Element of a Riemannian Manifold
The foundation of the proof is the standard definition of a volume form in Riemannian geometry.
- Let $(M, g)$ be a smooth, orientable $n$-dimensional Riemannian manifold. In our context, the manifold $M$ is the general linear group $GL(d, \mathbb{R})$, which is an open subset of the space of all $d \times d$ real matrices, $M_d$. The dimension is $n = d^2$.
The Volume Element of a Riemannian Manifold
The foundation of the proof is the standard definition of a volume form in Riemannian geometry.
- Let $(M, g)$ be a smooth, orientable $n$-dimensional Riemannian manifold. In our context, the manifold $M$ is the general linear group $GL(d, \mathbb{R})$, which is an open subset of the space of all $d \times d$ real matrices, $M_d$. The dimension is $n = d^2$.
- Let $\{x^1, \dots, x^n\}$ be a local coordinate system on an open set $U \subset M$. The Riemannian metric $g$ is represented in these coordinates by a matrix of its components, $(g_{ij})$, where $g_{ij} = g(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j})$.
The Volume Element of a Riemannian Manifold
The foundation of the proof is the standard definition of a volume form in Riemannian geometry.
- Let $(M, g)$ be a smooth, orientable $n$-dimensional Riemannian manifold. In our context, the manifold $M$ is the general linear group $GL(d, \mathbb{R})$, which is an open subset of the space of all $d \times d$ real matrices, $M_d$. The dimension is $n = d^2$.
- Let $\{x^1, \dots, x^n\}$ be a local coordinate system on an open set $U \subset M$. The Riemannian metric $g$ is represented in these coordinates by a matrix of its components, $(g_{ij})$, where $g_{ij} = g(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j})$.
- The canonical volume form $d\text{vol}_g$ on $U$ is the unique $n$-form given by:
$
d\text{vol}_g = \sqrt{\det(g_{ij})} \, dx^1 \wedge \dots \wedge dx^n
$
The Volume Element of a Riemannian Manifold
The foundation of the proof is the standard definition of a volume form in Riemannian geometry.
- Let $(M, g)$ be a smooth, orientable $n$-dimensional Riemannian manifold. In our context, the manifold $M$ is the general linear group $GL(d, \mathbb{R})$, which is an open subset of the space of all $d \times d$ real matrices, $M_d$. The dimension is $n = d^2$.
- Let $\{x^1, \dots, x^n\}$ be a local coordinate system on an open set $U \subset M$. The Riemannian metric $g$ is represented in these coordinates by a matrix of its components, $(g_{ij})$, where $g_{ij} = g(\frac{\partial}{\partial x^i}, \frac{\partial}{\partial x^j})$.
- The canonical volume form $d\text{vol}_g$ on $U$ is the unique $n$-form given by:
$
d\text{vol}_g = \sqrt{\det(g_{ij})} \, dx^1 \wedge \dots \wedge dx^n
$
- For this paper, we use the standard coordinates of $M_d$, where $x^k$ corresponds to the matrix entries $W_{ij}$. The standard volume element on $M_d$ is $dW := dW_{11} \wedge \dots \wedge dW_{dd}$. The metric is the tensor $g^N$. The volume form is therefore:
$
d\text{vol}_{g^N} = \sqrt{\det g^N} \, dW
$
where $\det g^N$ is the determinant of the metric tensor's matrix representation in the standard basis of $M_d$.
Assumption: The analysis is primarily restricted to $W \in GL(d, \mathbb{R})$, the space of full-rank (invertible) matrices. This ensures the metric $g^N$ and its associated operators are well-defined.
Hypothesis 3: Bias Towards High State Space Volume
The failure of the rank hypothesis suggests a more refined geometric principle is at play.
Hypothesis 3. Gradient-based methods are biased towards solutions that can be represented in "more ways" in the parameter space. This size is measured by the volume element induced by the Riemannian metric $g^N$.
The volume element tells us how to measure the "size" of a small region of matrices. Regions with higher volume are "bigger" targets for the optimization dynamics.
Volume as the Predictor
Now let's plot the logarithm of the state space volume on the same plane of solutions. (Fig 5 from Cohen, Menon, Veraszto, 2023)
The volume (brighter colors) is highest precisely at the corners of the hyperbola where the training outcomes clustered. This suggests that the dynamics are not just minimizing rank, but are being attracted to the regions of maximal volume within the set of low-rank solutions.
Quantifying Volume on the Solution Space
The volume of a small region of end-to-end matrices $W$ is given by $\sqrt{\det g^N} dW$.
Vandermonde Determinant. $\text{van}(\Lambda) = \prod_{1 \le i < j \le d} (\lambda_j - \lambda_i) $
Theorem 1.1 (Cohen, Menon, Veraszto, 2023). The volume element on the manifold of end-to-end matrices $(\mathcal{M}_d, g^N)$ is given in terms of the singular values $\Sigma$ by:
$
\sqrt{\det g^N} dW = N^{\frac{d(d-1)}{2}} \det(\Sigma^2)^{\frac{1-N}{2N}} \text{van}(\Sigma^{2/N}) d\Sigma \, dU \, dV
$
This formula shows that the volume density diverges as any singular value approaches zero (i.e., as the matrix $W$ approaches a lower rank). This divergence indicates a strong geometric bias towards low-rank matrices during the training process.
>
Proof of Divergence:
Let's analyze the volume density, which is proportional to $\det(\Sigma^2)^{\frac{1-N}{2N}} \text{van}(\Sigma^{2/N})$.
1. The first term can be expanded using the singular values $\sigma_1, \dots, \sigma_d$:
$
\det(\Sigma^2)^{\frac{1-N}{2N}} = \left( \prod_{i=1}^d \sigma_i^2 \right)^{\frac{1-N}{2N}} = \prod_{i=1}^d \sigma_i^{\frac{1-N}{N}} = \prod_{i=1}^d \sigma_i^{\frac{1}{N}-1}
$
2. Let's consider the behavior as one singular value, say $\sigma_d$, approaches zero, while the others remain non-zero. The expression can be written as:
$
\left( \sigma_d^{\frac{1}{N}-1} \right) \cdot \left( \prod_{i=1}^{d-1} \sigma_i^{\frac{1}{N}-1} \right) \cdot \left( \text{van}(\Sigma^{2/N}) \right)
$
3. For a network depth $N > 1$, the exponent $\frac{1}{N}-1$ is negative. We can write $\sigma_d^{\frac{1}{N}-1}$ as:
$
\sigma_d^{\frac{1}{N}-1} = \frac{1}{\sigma_d^{1-\frac{1}{N}}}
$
As $\sigma_d \to 0$, the exponent $1-\frac{1}{N}$ is positive, so $\sigma_d^{1-\frac{1}{N}} \to 0$.
Connecting Geometries: Riemannian Submersion
How does the "downstairs" geometry on the space of end-to-end matrices relate to the simple "upstairs" Euclidean geometry on the space of factors?
The connection is a Riemannian submersion. This means the projection map $\phi: \mathcal{M}_r \to M_r$ (from the balanced manifold of factors to the space of rank-r matrices) is a geometry-preserving map.
Theorem 13 (Menon & Yu, 2023). For each rank $r$, the metric $g^N$ on $M_r$ is obtained from the map $\phi: \mathcal{M}_r \to M_r$ by Riemannian submersion.
Quantifying Volume of Representations
The "upstairs" space of weights $\mathbf{W} = (W_N, \dots, W_1)$ contains many configurations that map to the same end-to-end matrix $W$. This set is the fiber, or group orbit, $O_W$. Its volume quantifies the degree of overparameterization.
Theorem 10 (Menon & Yu, 2023). The volume of the group orbit $O_W$ corresponding to an end-to-end matrix $W$ with singular values $\Sigma$ is:
\[ \text{vol}(O_W) = c_d^{N-1} \frac{\text{van}(\Sigma^2)}{\text{van}(\Sigma^{2/N})} \]
This provides an "entropic" interpretation: solutions with a larger orbit volume are more numerous and thus more likely to be found. This volume also diverges as singular values approach zero.
Proof Sketch: Connecting Geometry and Volume
1. Determinant of the Metric Tensor
The determinant of $g^N$ is computed not by direct calculation, but by exploiting its spectral properties.
- The metric tensor $g^N$ at a point $W = U \Sigma V^T$ can be diagonalized via a similarity transformation:
$
g^N = (V \otimes U) D^N(\Sigma) (V \otimes U)^T
$
where $D^N(\Sigma)$ is a diagonal matrix and $(V \otimes U)$ is an orthogonal matrix.
1. Determinant of the Metric Tensor
The determinant of $g^N$ is computed not by direct calculation, but by exploiting its spectral properties.
- The metric tensor $g^N$ at a point $W = U \Sigma V^T$ can be diagonalized via a similarity transformation:
$
g^N = (V \otimes U) D^N(\Sigma) (V \otimes U)^T
$
where $D^N(\Sigma)$ is a diagonal matrix and $(V \otimes U)$ is an orthogonal matrix.
- The determinant is invariant under similarity transformations by orthogonal matrices. Therefore:
$
\det g^N = \det\left((V \otimes U) D^N(\Sigma) (V \otimes U)^T\right) = \det D^N(\Sigma)
$
1. Determinant of the Metric Tensor
The determinant of $g^N$ is computed not by direct calculation, but by exploiting its spectral properties.
- The metric tensor $g^N$ at a point $W = U \Sigma V^T$ can be diagonalized via a similarity transformation:
$
g^N = (V \otimes U) D^N(\Sigma) (V \otimes U)^T
$
where $D^N(\Sigma)$ is a diagonal matrix and $(V \otimes U)$ is an orthogonal matrix.
- The determinant is invariant under similarity transformations by orthogonal matrices. Therefore:
$
\det g^N = \det\left((V \otimes U) D^N(\Sigma) (V \otimes U)^T\right) = \det D^N(\Sigma)
$
- The determinant of the diagonal matrix $D^N(\Sigma)$ is the product of its diagonal entries. These entries are the reciprocals of the eigenvalues, $\lambda_{il}^N$, of the linear operator $A_{N,W}$. Thus, we arrive at the key relation:
$
\det g^N = \prod_{i,l=1}^d \frac{1}{\lambda_{il}^N}
$
2. Eigenvalues and Eigenvectors of $A_{N,W}$
The eigenvalues $\lambda_{il}^N$ are provided by the following rigorous result, which is a restatement of Lemma 2.4.
Lemma 2.4 (Spectral Decomposition of $A_{N,W}$). Let $W \in GL(d, \mathbb{R})$ have the singular value decomposition $W=U\Sigma V^T$. The linear operator $A_{N,W}: M_d \to M_d$ has $d^2$ eigenvalues $\lambda_{il}^N$ for $i, l \in \{1, \dots, d\}$.
For a network of finite depth $N$, these eigenvalues are given by:
$
\lambda_{il}^N = \frac{1}{N} \sum_{j=1}^N (\sigma_i^2)^{\frac{N-j}{N}} (\sigma_l^2)^{\frac{j-1}{N}}
$
In the special case where $i=l$, this simplifies to $\lambda_{ll}^N = (\sigma_l^2)^{(N-1)/N}$.
The corresponding eigenvectors, which are independent of $N$, are given by $T_{il} = U E_{il} V^T$, where $E_{il}$ is the standard basis matrix with a 1 in the $(i,l)$ position and zeros elsewhere.
3. The Need for a Change of Coordinates
The expression for $\det g^N$ is a function of the singular values $\sigma_i$ of the matrix $W$. However, the volume element $dW$ is expressed in the coordinates of the matrix entries $W_{ij}$.
To derive a coherent and insightful formula for the volume form, we must express all components in a single, consistent coordinate system. The natural choice is the coordinate system provided by the Singular Value Decomposition, $(U, \Sigma, V)$.
This change of variables allows us to directly analyze how the geometric volume changes with the singular values, which is essential for understanding the dynamics of training and the geometric source of implicit regularization.
4. The Jacobian of the SVD Map
Changing from matrix entry coordinates to SVD coordinates requires the introduction of a Jacobian determinant, which accounts for the distortion of volume under this nonlinear map.
Assumption: To define the map rigorously, we consider $W$ in the open set of matrices with distinct, non-zero singular values, $\sigma_1 > \sigma_2 > \dots > \sigma_d > 0$. The results are then extended to the full space by continuity.
Lemma 2.7 (Jacobian of SVD). The change of variables from the volume element $dW$ on $M_d$ to the volume elements on the SVD components is given by:
$$
dW = \text{van}(\Sigma^2) \, d\Sigma \wedge dU \wedge dV
$$
where $d\Sigma = d\sigma_1 \wedge \dots \wedge d\sigma_d$, $dU$ and $dV$ are the Haar measures on the orthogonal group $O(d)$, and $\text{van}(\Sigma^2)$ is the Vandermonde determinant of the squared singular values:
$$
\text{van}(\Sigma^2) = \prod_{1 \le i < j \le d} (\sigma_i^2 - \sigma_j^2)
$$
5. Final Form of the Volume Element
The final step is to assemble the determinant of the metric (in terms of $\lambda_{il}^N$) and the Jacobian of the SVD map.
- Starting with the definition:
$
d\text{vol}_{g^N} = \sqrt{\det g^N} \, dW
$
5. Final Form of the Volume Element
The final step is to assemble the determinant of the metric (in terms of $\lambda_{il}^N$) and the Jacobian of the SVD map.
- Starting with the definition:
$
d\text{vol}_{g^N} = \sqrt{\det g^N} \, dW
$
- We substitute the expressions from steps 2 and 5:
$
d\text{vol}_{g^N} = \left( \prod_{i,l=1}^d \frac{1}{\lambda_{il}^N} \right)^{1/2} \text{van}(\Sigma^2) \, d\Sigma \, dU \, dV
$
5. Final Form of the Volume Element
The final step is to assemble the determinant of the metric (in terms of $\lambda_{il}^N$) and the Jacobian of the SVD map.
- Starting with the definition:
$
d\text{vol}_{g^N} = \sqrt{\det g^N} \, dW
$
- We substitute the expressions from steps 2 and 5:
$
d\text{vol}_{g^N} = \left( \prod_{i,l=1}^d \frac{1}{\lambda_{il}^N} \right)^{1/2} \text{van}(\Sigma^2) \, d\Sigma \, dU \, dV
$
- After performing the product over the eigenvalues $\lambda_{il}^N$ and simplifying the resulting expression (as detailed in Section 2.3 of the paper), we arrive at the final result stated in Theorem 1.1.
Theorem 1.1 (Volume Form for Finite Depth $N$). The volume form for the metric $g^N$ is:
$$
d\text{vol}_{g^N} = N^{d(d-1)/2} \det(\Sigma^2)^{(1-N)/(2N)} \text{van}(\Sigma^{2/N}) \, d\Sigma \, dU \, dV
$$
Summary of Findings
- The classical hypothesis that implicit regularization is equivalent to norm minimization is incorrect. There are natural problems where gradient descent drives all norms to infinity.
Summary of Findings
- The classical hypothesis that implicit regularization is equivalent to norm minimization is incorrect. There are natural problems where gradient descent drives all norms to infinity.
- A more robust heuristic is **rank minimization**, which correctly predicts the behavior in the counterexample.
Summary of Findings
- The classical hypothesis that implicit regularization is equivalent to norm minimization is incorrect. There are natural problems where gradient descent drives all norms to infinity.
- A more robust heuristic is **rank minimization**, which correctly predicts the behavior in the counterexample.
- However, rank alone is insufficient to explain why specific low-rank solutions are preferred over others.
Summary of Findings
- The classical hypothesis that implicit regularization is equivalent to norm minimization is incorrect. There are natural problems where gradient descent drives all norms to infinity.
- A more robust heuristic is **rank minimization**, which correctly predicts the behavior in the counterexample.
- However, rank alone is insufficient to explain why specific low-rank solutions are preferred over others.
- The most fundamental explanation appears to be a bias towards regions of **maximal state space volume**, a concept made precise by the Riemannian geometry of the DLN. The volume is largest near low-rank solutions, and can distinguish between different solutions of the same rank.
Next Time...
Effect of noise and discretization
Thank You!
Any Questions?