Deep Linear Networks

Rathindra Nath Karmakar

Session 4: Effect of Noise and Discretization

References

Recap: Key Questions for Gradient Flow

For the gradient flow dynamics on a loss surface $\mathcal{L}(\mathbf{W})$:

\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

We want to understand:

Today, we address the final question. What happens when we add stochastic noise to the dynamics?

Stochasticity in Deep Learning

Real-world training is not a clean gradient descent. Noise arises from many sources:

How do these stochastic effects interact with the Riemannian geometry of the problem? One tool is the Riemannian Langevin Equation (RLE).

Stochasticity in Deep Learning

Real-world training is not a clean gradient descent. Noise arises from many sources:

How do these stochastic effects interact with the Riemannian geometry of the problem? One tool is the Riemannian Langevin Equation (RLE).

Can be used to model noise due to round-off errors

The Central Idea

Noise in the "upstairs" parameter space (along redundant directions) induces a deterministic, curvature-driven drift in the "downstairs" solution space.

Warm-Up: Dyson Brownian Motion

To build intuition, we study a famous model from random matrix theory.

Prerequisites for Theorem 15

Theorem 15: Noise Upstairs, Curvature Downstairs

Consider the "upstairs" SDE for a matrix $M_t$ evolving on the isospectral orbit $O_{x_t}$ subject to isotropic noise:

\[ dM_t = P_{M_t} dH_t + \sqrt{\frac{2}{\beta}} P_{M_t}^\perp dH_t \]

Theorem 15 (Huang, Inauen, Menon, informal).

  1. The eigenvalues of the matrix $M_t$ have the same law as the solution $x_t$ to Dyson Brownian Motion.
  2. In the zero-noise limit ($\beta \to \infty$), noise purely tangential to the orbit ($P_{M_t}dH_t$) induces a deterministic drift normal to the orbit, equal to motion by (minus one half) **mean curvature**.

This is a profound result: purely random fluctuations in the "gauge" directions (upstairs) create a deterministic, geometry-driven motion (a drift) in the observable space (downstairs).

General Geometric Framework

General Principles 1: Submersion with Group Action

We formalize the "upstairs/downstairs" picture.

General Principles 2: The "Upstairs" RLE

We define stochastic gradient descent via the RLE of a lifted loss function $L = E \circ \phi$.

Consider the "upstairs" dynamics for $m \in \mathcal{M}$ given by the SDE:

\[ dm^{\beta,\kappa} = -\text{grad}_g L(m) dt + P_m d\mathbf{M}^\beta + \sqrt{\kappa} P_m^\perp d\mathbf{M}^\beta \]

General Principles 3: The "Downstairs" SDE

The "upstairs" stochastic flow projects to a stochastic flow downstairs.

The flow of $m_t$ projects to the RLE for the **free energy** downstairs:

\[ dx = -\text{grad}_h F_\beta(x) dt + dX^{\beta/\kappa} \]

where the free energy is defined as:

\[ F_\beta(x) = L(x) - \frac{1}{\beta}S(x) \quad \text{and} \quad S(x) = \log\text{vol}(O_x) \]

Noise in the redundant "gauge" directions upstairs manifests as an entropic term that modifies the energy landscape downstairs. In the limit $\kappa \to 0$, the upstairs noise is purely tangential, and the downstairs flow becomes deterministic: $\dot{x} = -\text{grad}_h F_\beta(x)$.

Application to the Deep Linear Network

RLE for DLN: Upstairs Dynamics

Applying the general principle to the DLN, the "upstairs" RLE on the balanced manifold $\mathcal{M}$ is:

\[ d\mathbf{W}^{\beta,\kappa} = -\nabla_{\mathbf{W}} E(\phi(\mathbf{W})) dt + d\mathbf{M}^{\beta,\kappa} \]

This is standard gradient descent on the lifted loss function $L=E \circ \phi$, plus a noise term $d\mathbf{M}^{\beta,\kappa}$ that represents Brownian motion on the balanced manifold (with the Frobenius metric).

RLE for DLN: Downstairs Dynamics

The law of the end-to-end matrix $W_t = \phi(\mathbf{W}_t)$ is then given by the "downstairs" RLE:

\[ dW^{\beta,\kappa} = -\text{grad}_{g^N} F_\beta(W^{\beta,\kappa}) dt + dX^{\beta/\kappa} \]

The Free Energy Gradient in the DLN

The gradient of the free energy, which drives the system's evolution, can be computed explicitly. It balances the drive to minimize loss with an opposing entropic force.

\[ \text{grad}_{g^N} F_\beta(W) = \underbrace{A_{N,W}(E'(W))}_{\text{Loss Term}} - \underbrace{\frac{1}{\beta}\text{grad}_{g^N}S(W)}_{\text{Entropic/Curvature Term}} \]

The second term is the entropic force arising from the geometry of overparameterization. For the DLN, it takes the specific form:

\[ \frac{1}{\beta}\text{grad}_{g^N}S(W) = \frac{1}{\beta} Q_N \Sigma' Q_0^T \]

The Free Energy Gradient in the DLN

For the DLN, it takes the specific form:

\[ \frac{1}{\beta}\text{grad}_{g^N}S(W) = \frac{1}{\beta} Q_N \Sigma' Q_0^T \]

The Explicit Downstairs SDE

The Menon & Yu paper provides an explicit formula for the Brownian motion $dX^{\beta/\kappa}$ on $(M_d, g^N)$.

Theorem 18 (Menon & Yu, 2023). The solution $X_t^\beta$ to the following Itô SDE with initial condition $X_0=W_0$ is Brownian motion on $(M_d, g^N)$:

\[ dX_t^\beta = \sqrt{\frac{2}{\beta}} \begin{pmatrix} \sqrt{N}\lambda_1^{N-1}dB_{11}^{1,1} & \dots \\ \vdots & \ddots \end{pmatrix} + \frac{1}{\beta} Q_N \Sigma'' Q_0^T dt \]

Here, $\Lambda = \Sigma^{1/N}$, $dB$ is a matrix of standard Wiener processes, and $\Sigma''$ is a diagonal matrix of drift terms (related to mean curvature) that arises from the Itô correction.

Proof for 2x2 Matrices: The Goal

Let's prove Theorem 15 for the simple case of $d=2$. We want to show that the eigenvalues $x_1, x_2$ of the matrix $M_t$ evolving by:

\[ dM_t = P_{M_t} dH_t + \sqrt{\frac{2}{\beta}} P_{M_t}^\perp dH_t \]

follow the Dyson Brownian Motion equations:

\[ dx_1 = \frac{1}{x_1 - x_2} dt + \sqrt{\frac{2}{\beta}} dW_1 \] \[ dx_2 = \frac{1}{x_2 - x_1} dt + \sqrt{\frac{2}{\beta}} dW_2 \]

Proof Step 1: Simplification

The dynamics are invariant under unitary transformations ($M \to UMU^*$). This allows us to analyze the process at a point where the matrix $M_t$ is diagonal, without loss of generality.

Proof Step 2: Decomposing the Noise

We express the "upstairs" noise $dH_t$ using an orthonormal basis for 2x2 Hermitian matrices that respects our tangent/normal split.

Normal Basis: $E_a = \begin{pmatrix} 1 & 0 \\ 0 & 0 \end{pmatrix}$, $E_b = \begin{pmatrix} 0 & 0 \\ 0 & 1 \end{pmatrix}$

Tangent Basis: $E_c = \frac{1}{\sqrt{2}}\begin{pmatrix} 0 & 1 \\ 1 & 0 \end{pmatrix}$, $E_d = \frac{1}{\sqrt{2}}\begin{pmatrix} 0 & i \\ -i & 0 \end{pmatrix}$

The matrix noise term $dH_t$ can be written with four independent Wiener processes $W_a, W_b, W_c, W_d$:

\[ dH_t = E_a dW_a + E_b dW_b + E_c dW_c + E_d dW_d \]

Projecting this noise onto the tangent ($P_X$) and normal ($P_X^\perp$) spaces at $X$ gives our SDE for $dM_t$:

\[ dM_t = \underbrace{(E_c dW_c + E_d dW_d)}_{\text{Tangent Part}} + \sqrt{\frac{2}{\beta}} \underbrace{(E_a dW_a + E_b dW_b)}_{\text{Normal Part}} \]

Proof Step 3: Itô's Formula for Eigenvalues

To find the dynamics of an eigenvalue, say $x_1(M_t)$, we use Itô's formula. This requires the first and second derivatives of the eigenvalue with respect to changes in the matrix.

For a function $f(M_t)$, Itô's formula is: $df = \text{D}f(dM_t) + \frac{1}{2}\text{D}^2f(dM_t, dM_t)$

Now we just need to plug our $dM_t$ into these formulas.

Proof Step 4: The Final Calculation

Let's compute the terms for $dx_1$.

Combining these gives: $dx_1 = \frac{1}{x_1 - x_2} dt + \sqrt{\frac{2}{\beta}} dW_1$. The proof for $x_2$ is identical.

Summary of Findings

Summary of Findings

Summary of Findings

Summary of Findings

Thank You!

Any Questions?