Deep Linear Networks

Rathindra Nath Karmakar

Session 1 : Layers are automatically balanced

References

Two Key Problems

Both problems are solved using gradient-based optimization

The Learning Problem: Goal & Model

  1. Goal: Learn a function $f: \mathcal{X} \to \mathcal{Y}$ from a dataset of input-output pairs $\mathcal{D} = \{(x_i, y_i)\}_{i=1}^m$.
  2. Model: We use a deep neural network, parameterized by weights $\mathbf{W}$, as our function approximator, $f(x; \mathbf{W})$.

The Learning Problem: Loss Function

  1. Loss: We quantify prediction error using a loss function:
\[ \mathcal{L}(\mathbf{W}) = \frac{1}{m} \sum_{i=1}^m L(f(x_i; \mathbf{W}), y_i) \]

The Learning Problem: Optimization

  1. Optimization: We find the best weights by minimizing the loss using the continuous-time dynamics of Gradient Flow.
\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

Gradient Flow: Key Equation

\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

Questions

Our Focus: Deep Homogeneous Models

A Key Property: Homogeneity

An activation function $\phi$ is homogeneous of degree 1 if for any scalar $c > 0$:

\[ \phi(cz) = c\phi(z) \]

The Consequence: Scale Invariance

Homogeneity in activations leads to a scale invariance in the entire network. We can scale up one layer's weights and scale down the next, and the overall function remains unchanged.

\[ f\left(x; \dots, cW^{(i)}, \frac{1}{c}W^{(i+1)}, \dots\right) = f\left(x; \dots, W^{(i)}, W^{(i+1)}, \dots\right) \]

This is because the scalar $c$ passes through the homogeneous activation $\phi$ and cancels out.

The Optimization Puzzle

This scale invariance creates a fundamental problem for optimization:

Why Classical Theory Fails

Many classical theorems that guarantee convergence of optimization algorithms rely on assumptions that are violated here.

Classical Assumption: The loss function is coercive (i.e., $\|\mathbf{W}\| \to \infty \implies \mathcal{L}(\mathbf{W}) \to \infty$) or the parameters are constrained to a compact set.

The "Obvious" Fix: Explicit Regularization

For a deep network, the most common form of explicit regularization is L2 regularization, also known as weight decay:

\[ \mathcal{L}_{reg}(\mathbf{W}) = \mathcal{L}(\mathbf{W}) + \lambda \sum_{h=1}^{N} \|W^{(h)}\|_F^2 \]

Matrix Factorization

The objective is to find factors $U \in \mathbb{R}^{d_1 \times r}$ and $V \in \mathbb{R}^{d_2 \times r}$ that approximate a target matrix $M^* \in \mathbb{R}^{d_1 \times d_2}$ where the rank $r \ll \min(d_1, d_2)$.

\[ \min_{U,V} f(U,V) = \frac{1}{2} \|UV^T - M^*\|_F^2 \]

The "Obvious" Fix: Explicit Regularization

A common theoretical workaround is to add a penalty term to the loss that forces the norms of the layers to be balanced:

\[ \min_{U,V} \frac{1}{2} \|UV^T - M^*\|_F^2 + \frac{\lambda}{4} \|U^T U - V^T V\|_F^2 \]

But... Is It Necessary?

Let's compare Gradient Descent on the original vs. the regularized problem. (Fig 1a from Du et al. 2018)

Convergence plot

The convergence rates are competitive. The original, unregularized problem converges just fine.

The Empirical Surprise

What happens to the norms of the factors during training? (Fig 1b from Du et al. 2018)

Norm ratio plot

The ratio of the norms remains constant, even for the unregularized objective. The layers stay balanced on their own!

The Empirical Surprise

Similar observation for the Deep Linear Networks (DLN) (Fig 2 from Du et al. 2018)

Norm ratio plot

The ratio of the norms remains constant, even for the unregularized objective. The layers stay balanced on their own!

The Central Question

Why does Gradient Descent automatically balance the layers and converge?

The Answer: Implicit Regularization

The Answer: Implicit Regularization

The Answer: Implicit Regularization

Let's Formalize: Gradient Flow

To analyze this, we study the dynamics of Gradient Descent in the continuous-time limit.

The weights $\mathbf{W}$ evolve over time $t$ according to the differential equation:

\[ \frac{d}{dt} \mathbf{W}(t) = - \nabla_{\mathbf{W}} \mathcal{L}(\mathbf{W}(t)) \]

This captures the behavior of GD with an infinitesimally small learning rate.

Result 1: Exact Autobalancing

For any deep network with homogeneous activations, gradient flow conserves a remarkable quantity at every single neuron.

Theorem 2.1 (Du, Hu, Lee 2018). Under gradient flow, for any neuron $i$ in any hidden layer $h$, the difference between the squared norms of its incoming and outgoing weights is conserved:

\[ \frac{d}{dt} \left( \|W^{(h)}[i,:]\|_2^2 - \|W^{(h+1)}[:,i]\|_2^2 \right) = 0 \]

Interpretation

The squared norm of the incoming weights to a neuron, $\|W^{(h)}[i,:]\|_2^2$, is the vector of weights feeding into neuron $i$.

The squared norm of the outgoing weights, $\|W^{(h+1)}[:,i]\|_2^2$, is the vector of weights coming from neuron $i$.

The theorem states that the difference between these two quantities is an invariant of motion. It does not change throughout training.

Corollary: Layer-wise Balancing

By summing over all neurons in a layer, we get a direct consequence for the entire weight matrices.

Corollary 2.1. For any adjacent layers $h$ and $h+1$, the difference between their squared Frobenius norms is conserved under gradient flow:

\[ \frac{d}{dt} \left( \|W^{(h)}\|_F^2 - \|W^{(h+1)}\|_F^2 \right) = 0 \]

The Implication

The difference in squared norms between adjacent layers is constant over time.

\[ \|W^{(h)}(t)\|_F^2 - \|W^{(h+1)}(t)\|_F^2 = \text{Constant} \]

A Stronger Invariance for Linear Networks

For the special case of linear activations ($\phi(x)=x$), an even stronger, matrix-level quantity is conserved.

Theorem 2.2 (Du, Hu, Lee 2018). If activation between layers $h$ and $h+1$ is linear, then under gradient flow:

\[ \frac{d}{dt} \left( W^{(h)}(t) (W^{(h)}(t))^T - (W^{(h+1)}(t))^T W^{(h+1)}(t) \right) = 0 \]

This shows that not just the difference of squared norms, but a more structured matrix difference, remains constant throughout training.

Connecting to Matrix Factorization

The matrix factorization problem for $UV^T \approx M^*$ is equivalent to a 2-layer linear network with weights $W^{(1)} = V^T$ and $W^{(2)} = U$.

Applying the stronger invariance from Theorem 2.2 to this case (with $h=1$) tells us that the quantity

\[ W^{(1)}(W^{(1)})^T - (W^{(2)})^T W^{(2)} = V^T V - U^T U \]

is conserved under gradient flow.

This implies that if we initialize with $U_0^T U_0 = V_0^T V_0$, this balance will be maintained by the dynamics.

From Theory to Practice: Open Questions

Gradient flow with perfectly balanced initialization is an idealization. In practice, we face a more complex scenario:

\[ \mathbf{W}_{k+1} = \mathbf{W}_k - \eta \nabla \mathcal{L}(\mathbf{W}_k) \]

Answers for Matrix Factorization (Rank-r)

While these questions are difficult for general deep networks, for the matrix factorization problem, we have concrete answers.

Lemma 3.1(i) (Du, Hu, Lee 2018). For rank-r matrix factorization, assume the entries of $U_0$ and $V_0$ are initialized i.i.d. from $\mathcal{N}(0, \sigma^2)$ for a sufficiently small $\sigma^2$. If we use a decaying learning rate $\eta_t = O(t^{-(1/2+\delta)})$ for $0 < \delta \le 1/2$, then with high probability, the imbalance remains bounded for all time $t$:

\[ \|U_t^T U_t - V_t^T V_t\|_F < \epsilon \]

This shows that the balancing property of gradient flow approximately holds for gradient descent.

Stronger Results for Rank-1 Case

For the rank-1 problem, even stronger results hold, guaranteeing not just balance but also fast convergence.

Theorem 3.2 (Du, Hu, Lee 2018). For rank-1 matrix factorization, assume $u_0 \sim \mathcal{N}(0, \delta I)$ and $v_0 \sim \mathcal{N}(0, \delta I)$ for a sufficiently small $\delta$. If we use a sufficiently small constant learning rate $\eta$, then with constant probability:

  1. (Approximate Balancing) The ratio of the squared norms remains bounded: $0 < c_0 \le \frac{\|u_t\|_2^2}{\|v_t\|_2^2} \le C_0$.
  2. (Linear Convergence) The algorithm converges to the global minimum in $O(\log(1/\epsilon))$ iterations.

This provides a comprehensive answer for the simplest non-trivial homogeneous model.

Proof Sketch

Unveiling the Mechanism of Autobalancing

Proof Setup: Two Linear Layers

To see the core idea, we'll prove the autobalancing result for the simplest case: a two-layer linear network.

The Goal

We want to prove the stronger, layer-wise balancing property that holds for linear networks (Theorem 2.2).

Show that the following holds:

\[ \frac{d}{dt} \left( W_1 W_1^T - W_2^T W_2 \right) = 0 \]

(Note: For matrices, $W W^T$ and $W^T W$ are SPSD matrices that capture the geometry of the rows and columns, respectively.)

Tool: Backpropagation (Chain Rule)

First, we write down the gradients using the chain rule (backpropagation).

Let $G = \nabla_{x_{out}} L$ be the gradient with respect to the output for a single data point, and let $x_1 = W_1 x_{in}$.

Calculation (Part 1)

Let's compute the time derivative of $W_1 W_1^T$.

\begin{align*} \frac{d}{dt} (W_1 W_1^T) &= \left(\frac{d W_1}{dt}\right) W_1^T + W_1 \left(\frac{d W_1}{dt}\right)^T \\[10pt] &= (-\nabla_{W_1} L) W_1^T - W_1 (\nabla_{W_1} L)^T \\[10pt] &= -(W_2^T G x_{in}^T) W_1^T - W_1 (W_2^T G x_{in}^T)^T \\[10pt] &= -W_2^T G (x_{in}^T W_1^T) - W_1 (x_{in} G^T W_2) \\[10pt] &= -W_2^T G x_1^T - W_1 x_{in} G^T W_2 \end{align*}

Calculation (Part 2)

Now, let's compute the time derivative of $W_2^T W_2$.

\begin{align*} \frac{d}{dt} (W_2^T W_2) &= \left(\frac{d W_2}{dt}\right)^T W_2 + W_2^T \left(\frac{d W_2}{dt}\right) \\[10pt] &= -(\nabla_{W_2} L)^T W_2 - W_2^T (\nabla_{W_2} L) \\[10pt] &= -(G x_1^T)^T W_2 - W_2^T (G x_1^T) \\[10pt] &= -(x_1 G^T) W_2 - W_2^T G x_1^T \\[10pt] &= -(W_1 x_{in}) G^T W_2 - W_2^T G x_1^T \end{align*}

The Punchline: Perfect Cancellation

Comparing the two results:

Term 1: \[ \frac{d}{dt} (W_1 W_1^T) = -W_2^T G x_1^T - W_1 x_{in} G^T W_2 \]

Term 2: \[ \frac{d}{dt} (W_2^T W_2) = - W_1 x_{in} G^T W_2 - W_2^T G x_1^T \]

They are exactly the same! Therefore:

\[ \frac{d}{dt} \left( W_1 W_1^T - W_2^T W_2 \right) = 0 \]

Summary of Findings

Summary of Findings

Summary of Findings

Summary of Findings

Summary of Findings

Next Time...

Implicit Acceleration

Prerequisites for the next day

Riemannian Geometry

  • A smooth Manifold $M$ is a space that is locally diffeomorphic to a Euclidean space (e.g., $\mathbb{R}^n$).
  • At each point $p \in M$, the Tangent Space $T_p M$ is the vector space of all velocity vectors of smooth curves on $M$ passing through $p$.

Example:

  • Manifold: $M = M_{d\times n}(\mathbb{R})$, the space of matrices.
  • Tangent Space: For any $W \in M$, the tangent space is just a copy of the space of matrices itself: $T_W M \cong M_{d\times n}(\mathbb{R})$.

The Riemannian Metric & Gradient

Definition: The Riemannian Gradient. Given a smooth function $E: M \to \mathbb{R}$, its gradient $\grad_p E$ at a point $p$ is the unique vector in $T_p M$ that represents the derivative $dE_p$ via the metric:

\[ g_p(\grad_p E, v) = dE_p(v) \quad \text{for all } v \in T_p M \]

The existence and uniqueness of $\grad_p E$ are guaranteed by the Riesz Representation Theorem on the Hilbert space $(T_p M, g_p)$.

Gradient Example 1: The Space of Matrices

Use the identity: $\text{Tr}(Z_1^T Z_2) = (\text{vec}(Z_1))^T (\text{vec}(Z_2))$.

\[ \grad E = \text{mat}(\nabla E_v) \]

Gradient Example 2: A Preconditioned Metric

Now, let's change the geometry. Let $A$ be a fixed, symmetric positive-definite (SPD) operator.

What is the Riemannian gradient, $\grad_W^A E$, under this new metric?

We must again satisfy the duality condition, but now with the new metric:

\[ g_W^A(\grad_W^A E, Z) = dE_W(Z) \]

Deriving the Preconditioned Gradient

Let's solve for the new gradient, $\grad_W^A E$. Let $\nabla E(W)$ be the standard Euclidean gradient matrix.

  1. Start with the defining property: \[ g_W^A(\grad_W^A E, Z) = dE_W(Z) \]
  2. Substitute the definitions for the metric and the directional derivative: \[ \text{Tr}((\grad_W^A E)^T A^{-1} Z) = \text{Tr}((\nabla E(W))^T Z) \]
  3. Using the cyclic property of the trace, $\text{Tr}(XYZ) = \text{Tr}(ZXY)$, on the left side: \[ \text{Tr}(Z (\grad_W^A E)^T A^{-1}) = \text{Tr}(Z (\nabla E(W))^T) \]
  4. Since this must hold for all tangent vectors $Z$, the terms multiplying $Z$ must be equal: \[ (\grad_W^A E)^T A^{-1} = (\nabla E(W))^T \implies A^{-1} \grad_W^A E = \nabla E(W) \]
\[ \grad_W^A E = A \nabla E(W) \]

The Takeaway

The gradient of a function is not absolute; it depends on the chosen geometry (metric).

By changing the metric from the standard Euclidean one to a "preconditioned" one, the gradient itself is transformed:

Matrix Decompositions

We will also need two fundamental tools from linear algebra: the Singular Value Decomposition (SVD) and the Polar Decomposition.

Singular Value Decomposition (SVD)

Theorem. For any real matrix $M \in \mathbb{R}^{m \times n}$, there exists a decomposition:

\[ M = U \Sigma V^T \]
  • $U \in \mathbb{R}^{m \times m}$ is an orthogonal matrix whose columns are the left singular vectors.
  • $V \in \mathbb{R}^{n \times n}$ is an orthogonal matrix whose columns are the right singular vectors.
  • $\Sigma \in \mathbb{R}^{m \times n}$ is a rectangular diagonal matrix. Its diagonal entries $\sigma_i$ are the non-negative, ordered singular values.

Polar Decompositions

Analogous to a complex number $z = r e^{i\theta}$, any square matrix can be decomposed into a "stretch" and a "rotation."

Any square matrix $W \in M_d$ admits both left and right polar decompositions.

Where:

SVD & Polar Decompositions: The Connection

The two decompositions are intrinsically linked. Each can be derived from the other.

Let the SVD of $W$ be $W = U \Sigma V^T$, where $U, V \in O_d$.

SVD $\implies$ Polar: We can construct the polar factors directly:

  • Stretch matrices: $P = V \Sigma V^T$ and $R = U \Sigma U^T$.
  • Orthogonal matrices: $Q = U V^T$ and $U = V U^T$.

Polar $\implies$ SVD: Start with the left polar form $W = QP$.

  1. Eigendecompose the SPSD matrix $P$: $P = V \Sigma V^T$.
  2. Substitute into the polar form: $W = Q (V \Sigma V^T)$.
  3. Group the orthogonal matrices: $W = (Q V) \Sigma V^T$.
  4. This is the SVD of $W$, where the left singular vector matrix is $U = Q V$.

Thank You!

Any Questions?