Given random variables $(X, Y) \in \mathcal{X}\times \mathcal{Y}$, a loss function $l: \mathcal{Y}\times\mathcal{Y} \rightarrow \reel$ and class of models $m : \mathcal{X} \rightarrow \mathcal{Y} \in \mathcal{M}$: \[ \min_{m \in \mathcal{M}} \espr{X, Y}{l(m(X), Y)} \]
In practice, $\mathcal{M} := \{ m_\theta | \theta \in \reel^d \}$.
\[ \min_{\theta \in \reel^d} F(\theta):= \espr{X, Y}{l(m_\theta(X), Y)} \]
In practice, no access to true distribution $X$ and $Y$.
Using a finite training dataset:
\[ \min_{\theta \in \reel^d} F_N(\theta):= \frac{1}{n}\sum_{i=1}^N{l(m_\theta(X_i), Y_i)} \]
If $N$ large enough, $F \approx F_N$. Hand-wavy application of Central Limit Theorem gives $F - F_N \approx \frac{1}{\sqrt{N}}$. Better bounds can be obtained under extra conditions.
See The Tradeoffs of Large Scale Learning, Bottou and Bousquet, NIPS 2008.
Let $F(w) \in \reel$ with $w\in \reel^d$, we assume F twice differentiable.
$F$ is convex iif $\forall x, y \in \reel^d$, $t \in [0, 1]$, \[ F(t x + (1 - t) y) \leq t F(x) + (1-t) F(y) \]
or equivalently as a first order condition $\forall u \in \reel^d$, \[ F(x + u) \geq F(x) + F'(x)^T u. \]
More details: nice convex optimization class from UCLA, especially for all the basics on convex functions and convex inequalities.
We say that $F$ is L-smooth when its gradient $F'$ is L-Liptchitz, \[ \norm{F'(x) - F'(y)} \leq L \norm{x - y}, \]
then \[ \forall x,y \in \reel^d, f(y) \leq F(x) + F'(x)^T (y-x) + \frac{L}{2} \norm{y-x}^2 \]
We say that $F$ is strongly convex when for all $t \in ]0, 1[$, \[ F(t x + (1 - t) y) < t F(x) + (1-t) F(y) \]
or equivalently $\forall u \in \reel^d$, $u\neq 0$, \[ F(x + u) > F(x) + F'(x)^T u. \]
and $\mu$-strongly convex when \[ \forall x,y \in \reel^d, F(y) \geq F(x) + F'(x)^T (y-x) + \frac{\mu}{2} \norm{y - x}^2 \]
$F$ convex iif \[ F'' \succeq 0, \]
$F$ $\mu$-strongly convex iif \[ F'' \succeq \mu, \]
$F$ L-smooth iif \[ F'' \preceq L. \]
For any position $x_0$, $u \rightarrow F(x_0 + u)$ is always contained between two quadratic functions.
$\kappa := \frac{L}{\mu}$ is called the condition number. The closer it is to one, the faster the convergence will be.
$\kappa$ depends on the parametrization of the problem.
We want to minimize $F(\theta)$, starting from $\theta_0$ and taking \[ \theta_{n+1} := \theta_{n} - \gamma F'(\theta_{n}). \] Let $F_*$ be the optimal value and $\theta_*$ any minimizer of $F$.
$L$-smooth case. For $\gamma = \frac{1}{L}$ then, \[ F(\theta_n) - F_* \leq \frac{L}{2 n} \norm{\theta_0 - \theta_*}^2 \]
$L$-smooth, $\mu$-strongly convex case. For $\gamma = \frac{1}{L}$ then, \[ \norm{\theta_n - \theta_*}^2 \leq \left(1 - \frac{1}{\kappa} \right)^n \norm{\theta_0 - \theta_*}^2 \]
Proofs available in Convex Optimization: Algorithms and Complexity by Sébastien Bubeck, section 3.2 (smooth) and section 3.4.2 (smooth and strongly convex).
Let us take $\theta,y \in \reel^n$, $\theta_+ := \theta - \gamma F'(\theta)$, with $\gamma \leq \frac{1}{L}$,
Now coming back to gradient descent with $\gamma = 1 / L$, \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &= \norm{\theta_n - \theta_*}^2 - \frac{2}{L} F(\theta_n)^T (\theta_n - \theta_*) + \frac{1}{L^2} \norm{F'(\theta_n)}^2. \end{aligned} \]
Taking the previous result with $y = \theta_*$, $\theta = \theta_n$, we have \[ \begin{aligned} 0 \leq F'(\theta_n)^T(\theta - \theta_*) - \frac{1}{L} \left(1 - \frac{1}{2} \right) \norm{F'(\theta_n)}^2 - \frac{\mu}{2}\norm{\theta_n - \theta_*}^2, \end{aligned} \]
or equivalently, \[ \begin{aligned} &- \frac{2}{L} F'(\theta_n)^T(x - \theta_*) + \frac{1}{L^2} \norm{F'(\theta_n)}^2 \leq - \frac{\mu}{L}\norm{\theta_n - \theta_*}^2 \end{aligned} \]
Injecting, we obtain \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &\leq \left( 1 - \frac{\mu}{L}\right)\norm{\theta_n - \theta_*}^2 \end{aligned} \]
and iterating \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &\leq \left( 1 - \frac{1}{\kappa}\right)^{n+1} \norm{\theta_0 - \theta_*}^2 \end{aligned} \]
For any algorithm that select $\theta_{n+1}$ in the set \[ \theta_0 + \mathrm{span}\left\{F'(\theta_0), \ldots, F'(\theta_n)\right\}, \]
for any $n \leq (d - 1) / 2$ there exist $f$ $L$-smooth such that for any method \[ F(\theta_n)- F_* \geq \frac{3}{32}\frac{L\norm{\theta_0 - \theta_*}^2}{(n+1)^2}. \]
Introductory Lectures on Convex Programming, Yuri Nesterov, theorem 2.1.6.
\[ F(\theta_n)- F_* \leq \mathcal{O}\left(\frac{L\norm{\theta_0 - \theta_*}^2}{n^2} \right) \]
Works very well for batch Gradient Descent but very sensitive to noise -> does not work in the stochastic setting.
Smooth minimization of non-smooth functions, Yuri Nesterov, 2004, theorem 2.
Equivalent to improving the condition number $\kappa$. Not very popular for batch optimisation, but used a lot for stochastic optimisation in deep learning.
Origin:
Some methods of speeding up the convergence of iteration methods
, Polyak, 1964 (restricted access, should work from an ENS IP address).
Importance in deep learning:
On the importance of initialization and momentum in deep learning
, Sutskever et al, ICML 2013.
Very efficient for small case problems ($d$ small): \[ \theta_{n+1} = \theta_n - \gamma F''(\theta_n)^{-1} F'(\theta_n) \]
Equivalent to having $\kappa := \frac{L}{\mu} = 1$.
For large $d$, computing $F''$ is prohibitive. A DenseNet has 8 millions parameters. Storing $F''$ requires 232 TB of memory!
Possiblity to approximate $F''$ for large scale problems -> L-BFGS.
Like Nesterov acceleration, very sensitive to noise, not suitable for the stochastic setting.
$f$ convex random function, \[ F(\theta) = \esp{f(\theta)}. \]
Remember our initial application: \[ F(\theta) := \espr{X, Y}{l(m_\theta(X), Y)}, \] \[ f(\theta) := l(m_\theta(X), Y) \]
At each iteration $n$, sample $f_n$ \[ \theta_{n+1} = x_n - \gamma_n f_n'(\theta_n). \]
For a finite dataset of size N, $f_n(\theta) = l(m_\theta(X_{I_n}), Y_{I_n})$ with $(I_n)_{n \geq 0}$ i.i.d uniform integers in [1, N] (sampling with replacement).
$f_n'(\theta_n)$ is an unbiased estimator of $F'(\theta_n)$, i.e. \[ \esp{f_n'(\theta_n) | \theta_n} = F'(\theta_n). \]
$f_n'(\theta_n)$ points in the right direction on average and much cheaper to compute: trade-off between speed of computation and accuracy.
Extra term compared to batch GD!
Far from the optimum, $\norm{F'(\theta)}^2$ dominates -> similar to GD.
Close to the optimum, $\variance{f_n'(\theta_*) | \theta_n}$ dominates -> stochastic noise.
However one SGD iteration is $N$ times faster than GD ! ($N$ size of the dataset)
Let us assume $\gamma_n := \gamma$, $\mu \preceq F'' \preceq L$ and a.s $f'' \preceq R^2$. If $\gamma < \frac{L}{2}$ then, \[ \esp{F(\theta_n)} - F_* \leq (1 - \gamma \mu)^n \gamma^{-1}\delta_0^2 + 2 \gamma \frac{\sigma^2}{\mu}, \] where $\delta_0=\norm{\theta_0 - \theta_*}$ and $\sigma^2=\esp{\norm{f'(\theta_*)}^2}$
Far from the optimum, error decays exponentially fast.
Close to the optimum, error proportional to $\gamma$.
Reducing $\gamma$ -> better solution. Repeat until you are happy.
Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm, Needell et al, 2015. Theorem 2.1.
For the strictly convex case, with $\gamma_n \propto \frac{1}{n}$, \[ \esp{f(x_n)} - F_* \leq \mathcal{O}\left(\frac{1}{n}\right) \]
For the non strictly convex case, with $\gamma_n \propto \frac{1}{n}$, \[ \esp{f(x_n)} - F_* \leq \mathcal{O}\left(\frac{1}{\sqrt{n}}\right) \]
Reference: Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning Bach and Moulines, NIPS 2011.
GD has better convergence rate but prohibitive iteration cost.
Far from the optimum, each iteration of SGD will give as much improvement as GD.
In practice, $F_N$ is an approximation of $F$. More important to handle larger $N$ than to optimize perfectly for a small $N$.
See The Tradeoffs of Large Scale Learning, Bottou and Bousquet, NIPS 2008.
Over parametrization -> $f'(\theta_*) = 0$ almost surely. So that $\sigma^2 = 0$ and SGD converges as fast as GD.
See The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning, Ma et al, 2017.
In SGD, we sample one function $f$ at each iteration.
Mini batch SGD: \[ \theta_{n+1} = \theta_n - \gamma \sum_{i=1}^B f'_{i, n}(\theta_n) \]
Less variance of the gradient estimate, in particular $\sigma^2_B = \frac{\sigma^2}{B}$.
Easy to parallelize (better efficiency on GPUs in particular).
If $B$ too large, same as GD -> sweet spot to find for each problem.
SGD with momentum: \[ \begin{cases} m_{n+1} &= \beta m_n + f'_n(\theta_n),\\ \theta_{n+1} &= \theta_n - \gamma m_{n+1}. \end{cases} \]
Almost always used with SGD for deep learning. Why it helps so much for deep learning is not entirely understood.
See On the importance of initialization and momentum in deep learning , Sutskever et al, ICML 2013.
Solution? Automatically pick a per coordinate learning rate!
For any vector $x$, $x^{(k)}$ represents its k-th component. With parameters $\alpha, \epsilon$, taking $g_0^{(k)} = 0$, \begin{cases} g_{n+1}^{(k)} &= g_{n}^{(k)} + \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\sqrt{ \epsilon + g_{n+1}^{(k)}}}f_n^{'(k)}(\theta_n). \end{cases}
See Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Duchi et al, JMLR 2011.
Much less sensitive to the choice of $\alpha$ than SGD.
Scale invariance i.e. reparameterizing the space by a scalar factor does not affect the trajectory.
Downside: learning rate goes to zero. Never forgets about the past.
How to forget about the past? Using an exponential average.
With parameters $\alpha, \beta, \epsilon$, taking $g_0^{(k)} = 0$, \begin{cases} g_{n+1}^{(k)} &= \beta g_{n}^{(k)} + (1 - \beta) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\sqrt{ \epsilon + g_{n+1}^{(k)}}}f_n^{'(k)}(\theta_n). \end{cases}
Key idea for RMSProp and optimizers in the next slides -> updates of size roughly $\alpha$ for all coordinates.
No paper for this one. Was taught by Hinto on Coursera, you can look at his slides.
Adam = RMS Prop + momentum. With parameters $\alpha, \beta_1, \beta_2, \epsilon$, taking $v_0^{(k)} = m_0^{(k)} = 0$,
\begin{cases} m_{n+1}^{(k)} &= \beta_1 m_n^{(k)} + (1 - \beta_1) f_n^{'(k)}(\theta_n),\\ v_{n+1}^{(k)} &= \beta_2 v_n^{(k)} + (1 - \beta_2) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \hat{m}_{n+1} &= \frac{m_{n+1}}{1 - \beta_1^{n+1}},\quad \hat{v}_{n+1} = \frac{v_{n+1}}{1 - \beta_2^{n+1}},\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\epsilon + \sqrt{\hat{v}_{n+1}}} m_{n+1}. \end{cases}
$\hat{m}$ and $\hat{v}$ are corrections for the initial scale of the gradients. $m_1 = (1 - \beta_1) f'_0(\theta_n)$ while $\hat{m}_1 = f'_0(\theta_n)$.
See Adam: A method for stochastic optimization, Kingma and Ba, ICLR 2015.
Adam with decreasing step size. With parameters $\alpha, \beta_1, \beta_2, \epsilon$, taking $v_0^{(k)} = m_0^{(k)} = 0$,
\begin{cases} m_{n+1}^{(k)} &= \beta_1 m_n^{(k)} + (1 - \beta_1) f_n^{'(k)}(\theta_n),\\ v_{n+1}^{(k)} &= \beta_2 v_n^{(k)} + (1 - \beta_2) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \hat{m}_{n+1} &= \frac{m_{n+1}}{1 - \beta_1^{n+1}},\\ \hat{v}_{n+1} &= \color{red}{\max}\left(\hat{v}_n, \frac{v_{n+1}}{1 - \beta_2^{n+1}}\right),\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\epsilon + \sqrt{\hat{v}_{n+1}}} m_{n+1}. \end{cases}
See On the convergence of Adam and Beyond, Reddi et al, ICLR 2018.
from torch import nn
from torch import optim
from torch.utils.data import DataLoader
def train(model, device, train_loader, optimizer):
model.train()
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.to(device), target.to(device)
optimizer.zero_grad()
output = model(data)
loss = F.nll_loss(output, target)
loss.backward()
optimizer.step()
train_loader = DataLoader(my_dataset, shuffle=True, batch_size=64)
device = "cpu" # or "cuda"
model = MyModel().to(device)
optimizer = optim.Adam(model.parameters())
train(model, device, optimizer, train_loader)
All located in torch.optim.*
.
All have similar constructor
torch.optim.*(params, lr=..., momentum=..., weight_decay=0)
.
Default values are different for all optimizer, check the doc.
params
should be a list of all tensors to optimize over.
Can be obtained from any module with module.parameters()
.
What is weight_decay
? If we denote $\lambda$ its value, then equivalent to optimizing
\[
\min_{\theta\in\reel^d} \espr{X, Y}{l(m_\theta(X), Y) + \frac{\lambda}{2} \norm{\theta}_2^2}
\]
Act as a regularizer. Very useful in vision, much less in NLP (Natural Language Processing). When working on a specific application, check the best practices in papers.
Different tensors can require different optimization parameters.
model_part1 = ...
model_part2 = ...
optimizer = torch.optim.SGD([
{"params": model_part1.parameters()},
{"params": model_part2.parameters(),
"lr": 0.1, "weight_decay":1e-4}],
lr=0.01)
# model_part1 will use lr=0.01 and no weight decay
# model_part2 will use lr=0.1 and weight decay of 1e-4
Once the loss is no longer decreasing, reducing the learning rate can help.
Typical approach in vision, divide learning rate by D every T epochs.
model = ...
optimizer = torch.optim.SGD(
model.parameters(), "lr": 0.01, "momentum": 0.9)
scheduler = optim.lr_scheduler.StepLR(
optimizer, step_size=20, gamma=0.1)
for epoch in range(epochs):
scheduler.step()
...
nvidia-smi
).
weight_decays = [1e-5, 1e-4, 1e-3]
.
wget http://ai.honu.io/presentations/notebooks/diy_2018_optim.ipynb
wget http://ai.honu.io/presentations/notebooks/environment.yml
conda env update
conda activate diy
jupyter notebook