For the gradient flow dynamics on a loss surface $\mathcal{L}(\mathbf{W})$:
We want to understand:
Today, we address the second question: can overparameterization improve the **rate of convergence**?
The conventional wisdom is that increasing depth improves expressiveness but complicates optimization.
Today's message, based on Arora, Cohen, & Hazan (2018), is that sometimes:
Increasing depth can accelerate optimization.
We will see that overparameterization via depth can act as an implicit preconditioner, combining the effects of momentum and adaptive learning rates.
Comparison of gradient descent on a linear regression task with $\ell_p$ loss. (Fig 3 from Arora et al. 2018)
Let $W_e = W_N W_{N-1} \cdots W_1$ be the end-to-end matrix. The paper shows that its gradient flow dynamics can be written purely in terms of $W_e$ itself.
Theorem 1 (Arora et al. 2018). Under the balanced initialization assumption, $W_{j+1}(t_0)^T W_{j+1}(t_0) = W_j(t_0)W_j(t_0)^T$, the gradient flow for $W_e$ is:
\[ \dot{W}_e(t) = - \eta \sum_{j=1}^{N} \left[ W_e(t) W_e(t)^T \right]^{\frac{N-j}{N}} \cdot \nabla\mathcal{L}(W_e(t)) \cdot \left[ W_e(t)^T W_e(t) \right]^{\frac{j-1}{N}} \]The standard gradient $\nabla\mathcal{L}(W_e)$ is pre-multiplied and post-multiplied by fractional matrix powers of $W_e W_e^T$ and $W_e^T W_e$.
The complex update rule from Theorem 1 seems arbitrary. But it has a beautiful geometric meaning, unifying it with the concepts from our last lecture.
Let's define the linear preconditioning operator from Theorem 1 as $A_{N,W_e}$: \[ A_{N,W_e}(Z) = \sum_{k=1}^{N} (W_e W_e^T)^{\frac{N-k}{N}} Z (W_e^T W_e)^{\frac{k-1}{N}} \] So the dynamics are $\dot{W}_e = -A_{N,W_e} (E'(W_e))$, where $E'(W_e) = \nabla\mathcal{L}(W_e)$ is the standard Euclidean gradient.
The complex update rule from Theorem 1 seems arbitrary. But it has a beautiful geometric meaning, unifying it with the concepts from our last lecture.
Let's define the linear preconditioning operator from Theorem 1 as $A_{N,W_e}$: \[ A_{N,W_e}(Z) = \sum_{k=1}^{N} (W_e W_e^T)^{\frac{N-k}{N}} Z (W_e^T W_e)^{\frac{k-1}{N}} \] So the dynamics are $\dot{W}_e = -A_{N,W_e} (E'(W_e))$, where $E'(W_e) = \nabla\mathcal{L}(W_e)$ is the standard Euclidean gradient.
Definition: A New Metric.
Following Menon (Sec 4.3), we define a position-dependent Riemannian metric $g^N$ on the tangent space at $W_e$ via the inner product:
\[ g^N(W_e)(Z_1, Z_2) := \text{Tr}\left(Z_1^T (A_{N,W_e})^{-1} Z_2\right) \](Assuming $W_e$ has full rank, $A_{N,W_e}$ is invertible.)
The complex update rule from Theorem 1 seems arbitrary. But it has a beautiful geometric meaning, unifying it with the concepts from our last lecture.
Let's define the linear preconditioning operator from Theorem 1 as $A_{N,W_e}$: \[ A_{N,W_e}(Z) = \sum_{k=1}^{N} (W_e W_e^T)^{\frac{N-k}{N}} Z (W_e^T W_e)^{\frac{k-1}{N}} \] So the dynamics are $\dot{W}_e = -A_{N,W_e} (E'(W_e))$, where $E'(W_e) = \nabla\mathcal{L}(W_e)$ is the standard Euclidean gradient.
Definition: A New Metric.
Following Menon (Sec 4.3), we define a position-dependent Riemannian metric $g^N$ on the tangent space at $W_e$ via the inner product:
\[ g^N(W_e)(Z_1, Z_2) := \text{Tr}\left(Z_1^T (A_{N,W_e})^{-1} Z_2\right) \](Assuming $W_e$ has full rank, $A_{N,W_e}$ is invertible.)
Under this specific metric, the complicated dynamics of Theorem 1 become a simple and natural Riemannian gradient flow:
For the special case of a single output ($k=1$), the preconditioning scheme simplifies to a more intuitive form.
Claim 2 (Arora et al. 2018). For a single output network, the end-to-end dynamics are equivalent to:
\[ \dot{W}_e = -\eta \underbrace{\|W_e\|^{2 - \frac{2}{N}}}_{\text{Adaptive LR}} \left( \nabla\mathcal{L}(W_e) + \underbrace{(N-1)\Pr_{W_e}(\nabla\mathcal{L}(W_e))}_{\text{Momentum-like term}} \right) \]where $\Pr_{W_e}(V)$ is the orthogonal projection of vector $V$ onto the direction of vector $W_e$.
For the special case of a single output ($k=1$), the preconditioning scheme simplifies to a more intuitive form.
Claim 2 (Arora et al. 2018). For a single output network, the end-to-end dynamics are equivalent to:
\[ \dot{W}_e = -\eta \underbrace{\|W_e\|^{2 - \frac{2}{N}}}_{\text{Adaptive LR}} \left( \nabla\mathcal{L}(W_e) + \underbrace{(N-1)\Pr_{W_e}(\nabla\mathcal{L}(W_e))}_{\text{Momentum-like term}} \right) \]where $\Pr_{W_e}(V)$ is the orthogonal projection of vector $V$ onto the direction of vector $W_e$.
For the special case of a single output ($k=1$), the preconditioning scheme simplifies to a more intuitive form.
Claim 2 (Arora et al. 2018). For a single output network, the end-to-end dynamics are equivalent to:
\[ \dot{W}_e = -\eta \underbrace{\|W_e\|^{2 - \frac{2}{N}}}_{\text{Adaptive LR}} \left( \nabla\mathcal{L}(W_e) + \underbrace{(N-1)\Pr_{W_e}(\nabla\mathcal{L}(W_e))}_{\text{Momentum-like term}} \right) \]where $\Pr_{W_e}(V)$ is the orthogonal projection of vector $V$ onto the direction of vector $W_e$.
Let's derive the end-to-end dynamics for $W_e = W_2 W_1$. The time derivative is $\dot{W}_e = \dot{W}_2 W_1 + W_2 \dot{W}_1$.
Let's derive the end-to-end dynamics for $W_e = W_2 W_1$. The time derivative is $\dot{W}_e = \dot{W}_2 W_1 + W_2 \dot{W}_1$.
Let's derive the end-to-end dynamics for $W_e = W_2 W_1$. The time derivative is $\dot{W}_e = \dot{W}_2 W_1 + W_2 \dot{W}_1$.
Let's derive the end-to-end dynamics for $W_e = W_2 W_1$. The time derivative is $\dot{W}_e = \dot{W}_2 W_1 + W_2 \dot{W}_1$.
This matches the general formula from Theorem 1 for $N=2$.
How do we derive the fractional powers for $N>2$? Let's analyze $N=3$. Let the SVD of each layer be $W_j = U_j \Sigma_j V_j^T$.
How do we derive the fractional powers for $N>2$? Let's analyze $N=3$. Let the SVD of each layer be $W_j = U_j \Sigma_j V_j^T$.
How do we derive the fractional powers for $N>2$? Let's analyze $N=3$. Let the SVD of each layer be $W_j = U_j \Sigma_j V_j^T$.
Let's express the end-to-end matrix $W_e = W_3 W_2 W_1$ using the SVDs and the relationships we found.
This looks like an SVD for $W_e$, with singular value matrix $\Sigma^3$ and orthogonal matrices $U_3$ and $V_1^T$.
From $W_e = U_3 \Sigma^3 V_1^T$, we can identify the SVD components of $W_e = U_e \Sigma_e V_e^T$ as:
\[ U_e = U_3 \quad , \quad \Sigma_e = \Sigma^3 \quad , \quad V_e = V_1 \]Now we can express the individual layer terms using the end-to-end SVD components.
This is where the fractional powers in Theorem 1 come from. The exponent is $\frac{N-j}{N}$ for post-multiplication and $\frac{j-1}{N}$ for pre-multiplication.
We analyze a simple, ill-conditioned linear regression problem with $\ell_4$ loss and $N=2$ overparameterization.
For a general convex function $L(w)$ with an L-Lipschitz continuous gradient, the following bounds provide the context for our analysis:
To guarantee that the loss decreases at every step ($L(w_{t+1}) < L(w_t)$), the learning rate $\eta$ must be within the stable range:
\[ \eta < \frac{2}{L} \]where $L = \max \|\nabla^2 L(w)\|$ is the Lipschitz constant, which measures the maximum curvature of the loss landscape.
Under this stability condition, the error is guaranteed to decrease with a sublinear rate of $O(1/T)$:
\[ L(w_T) - L(w^*) \le \frac{\|w_0 - w^*\|^2}{2\eta T} \]This bound shows that to cut the error in half, you may need to double the number of iterations $T$.
The standard gradient descent update for each coordinate is decoupled:
\[ w_i^{(t+1)} \leftarrow w_i^{(t)} - \eta (w_i^{(t)} - y_i)^3 \]The standard gradient descent update for each coordinate is decoupled:
\[ w_i^{(t+1)} \leftarrow w_i^{(t)} - \eta (w_i^{(t)} - y_i)^3 \]The standard gradient descent update for each coordinate is decoupled:
\[ w_i^{(t+1)} \leftarrow w_i^{(t)} - \eta (w_i^{(t)} - y_i)^3 \]For $N=2$, the discrete update rule for a single output network is:
\[ \mathbf{w}^{(t+1)} \leftarrow \mathbf{w}^{(t)} - \eta \left( \|\mathbf{w}^{(t)}\| \nabla\mathcal{L}(\mathbf{w}^{(t)}) + \Pr_{\mathbf{w}^{(t)}}(\nabla\mathcal{L}(\mathbf{w}^{(t)})) \right) \]Let's analyze the first iteration ($t=0$). The gradients are $\nabla_1 \approx -y_1^3$ and $\nabla_2 \approx -y_2^3$.
The update for each component is:
\begin{align*} w_1^{(1)} &\approx \epsilon_1 - \eta \left( \sqrt{\epsilon_1^2+\epsilon_2^2}(-y_1^3) + \frac{\epsilon_1(-y_1^3)+\epsilon_2(-y_2^3)}{\epsilon_1^2+\epsilon_2^2}\epsilon_1 \right) \\ w_2^{(1)} &\approx \epsilon_2 - \eta \left( \sqrt{\epsilon_1^2+\epsilon_2^2}(-y_2^3) + \frac{\epsilon_1(-y_1^3)+\epsilon_2(-y_2^3)}{\epsilon_1^2+\epsilon_2^2}\epsilon_2 \right) \end{align*}Using the condition $\epsilon_2/\epsilon_1 \approx y_2/y_1 \ll 1$, we have $\sqrt{\epsilon_1^2+\epsilon_2^2} \approx \epsilon_1$. This simplifies the updates to:
\[ w_1^{(1)} \approx \epsilon_1 + \eta(2\epsilon_1 y_1^3) \quad , \quad w_2^{(1)} \approx \epsilon_2 + \eta(\epsilon_1 y_2^3 + \epsilon_2 y_1^3) \]With the update rules $w_1^{(1)} \approx \epsilon_1 + \eta(2\epsilon_1 y_1^3)$ and $w_2^{(1)} \approx \epsilon_2 + \eta(\epsilon_1 y_2^3 + \epsilon_2 y_1^3)$:
Overparameterization allows the large coordinate to "lend" its scale to accelerate the small coordinate.
Characterizing the Minimizer
Any Questions?