\[ \newcommand{\reel}{\mathbb{R}} \newcommand{\norm}[1]{\left\|#1\right\|} \newcommand{\esp}[1]{\mathbb{E}\left[#1\right]} \newcommand{\espr}[2]{\mathbb{E}_{#1}\left[#2\right]} \newcommand{\prox}[1]{\mathrm{prox}_{ #1}} \newcommand{\variance}[1]{\mathbb{V}\left[#1\right]} \]

Deep Learning: Do-It-Yourself!

Optimisation methods

Slides available at ai.honu.io.

Plan for today:

Introduction

Motivation

Given random variables $(X, Y) \in \mathcal{X}\times \mathcal{Y}$, a loss function $l: \mathcal{Y}\times\mathcal{Y} \rightarrow \reel$ and class of models $m : \mathcal{X} \rightarrow \mathcal{Y} \in \mathcal{M}$: \[ \min_{m \in \mathcal{M}} \espr{X, Y}{l(m(X), Y)} \]

In practice, $\mathcal{M} := \{ m_\theta | \theta \in \reel^d \}$.

\[ \min_{\theta \in \reel^d} F(\theta):= \espr{X, Y}{l(m_\theta(X), Y)} \]

Empirical risk minimization

In practice, no access to true distribution $X$ and $Y$.

Using a finite training dataset:

\[ \min_{\theta \in \reel^d} F_N(\theta):= \frac{1}{n}\sum_{i=1}^N{l(m_\theta(X_i), Y_i)} \]

If $N$ large enough, $F \approx F_N$. Hand-wavy application of Central Limit Theorem gives $F - F_N \approx \frac{1}{\sqrt{N}}$. Better bounds can be obtained under extra conditions.

See The Tradeoffs of Large Scale Learning, Bottou and Bousquet, NIPS 2008.

Plan for today:

Refresher on convex functions

Convex functions

Let $F(w) \in \reel$ with $w\in \reel^d$, we assume F twice differentiable.

$F$ is convex iif $\forall x, y \in \reel^d$, $t \in [0, 1]$, \[ F(t x + (1 - t) y) \leq t F(x) + (1-t) F(y) \]

or equivalently as a first order condition $\forall u \in \reel^d$, \[ F(x + u) \geq F(x) + F'(x)^T u. \]

More details: nice convex optimization class from UCLA, especially for all the basics on convex functions and convex inequalities.

$F(x) = x^2$

Smoothness

We say that $F$ is L-smooth when its gradient $F'$ is L-Liptchitz, \[ \norm{F'(x) - F'(y)} \leq L \norm{x - y}, \]

then \[ \forall x,y \in \reel^d, f(y) \leq F(x) + F'(x)^T (y-x) + \frac{L}{2} \norm{y-x}^2 \]

Strong convexity

We say that $F$ is strongly convex when for all $t \in ]0, 1[$, \[ F(t x + (1 - t) y) < t F(x) + (1-t) F(y) \]

or equivalently $\forall u \in \reel^d$, $u\neq 0$, \[ F(x + u) > F(x) + F'(x)^T u. \]

and $\mu$-strongly convex when \[ \forall x,y \in \reel^d, F(y) \geq F(x) + F'(x)^T (y-x) + \frac{\mu}{2} \norm{y - x}^2 \]

Convexity and Hessian

$F$ convex iif \[ F'' \succeq 0, \]

$F$ $\mu$-strongly convex iif \[ F'' \succeq \mu, \]

$F$ L-smooth iif \[ F'' \preceq L. \]

$F(x) = \log(1 + \exp(-x)) + \frac{0.1}{2}\norm{x}^2$

For any position $x_0$, $u \rightarrow F(x_0 + u)$ is always contained between two quadratic functions.

$\kappa := \frac{L}{\mu}$ is called the condition number. The closer it is to one, the faster the convergence will be.

$\kappa$ depends on the parametrization of the problem.

Plan for today:

Batch optimization

Batch gradient descent

We want to minimize $F(\theta)$, starting from $\theta_0$ and taking \[ \theta_{n+1} := \theta_{n} - \gamma F'(\theta_{n}). \] Let $F_*$ be the optimal value and $\theta_*$ any minimizer of $F$.

Convergence

$L$-smooth case. For $\gamma = \frac{1}{L}$ then, \[ F(\theta_n) - F_* \leq \frac{L}{2 n} \norm{\theta_0 - \theta_*}^2 \]

$L$-smooth, $\mu$-strongly convex case. For $\gamma = \frac{1}{L}$ then, \[ \norm{\theta_n - \theta_*}^2 \leq \left(1 - \frac{1}{\kappa} \right)^n \norm{\theta_0 - \theta_*}^2 \]

Proofs available in Convex Optimization: Algorithms and Complexity by Sébastien Bubeck, section 3.2 (smooth) and section 3.4.2 (smooth and strongly convex).

Sketch of proof

Let us take $\theta,y \in \reel^n$, $\theta_+ := \theta - \gamma F'(\theta)$, with $\gamma \leq \frac{1}{L}$,

Lemma:
\[ \begin{aligned} F(\theta_+) - F(y) \leq F'(\theta)^T (\theta - y) &- \gamma \left(1 - \frac{\gamma L}{2} \right) \norm{F'(\theta)}^2 \\&- \frac{\mu}{2} \norm{\theta - y}^2 \end{aligned} \]
\[ \begin{aligned} F(\theta_+)- F(y) &= F(\theta_+) - F(\theta) + F(\theta) - F(y)\\ &\fragment{1}{\leq F'(\theta)^T (\theta_+ - \theta) + \frac{L}{2}\norm{\theta_+ - \theta}^2 +F'(\theta)^T(\theta - y) - \frac{\mu}{2}\norm{\theta - y}^2}\\ &\fragment{2}{= F'(\theta)^T (\theta_+ - y) + \frac{L \gamma^2}{2}\norm{F'(\theta)}^2 - \frac{\mu}{2}\norm{\theta - y}^2}\\ &\fragment{3}{= F'(\theta)^T(\theta - y) - \gamma \left(1 - \frac{\gamma L}{2} \right) \norm{F'(\theta)}^2 - \frac{\mu}{2}\norm{\theta - y}^2} \end{aligned} \]

Now coming back to gradient descent with $\gamma = 1 / L$, \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &= \norm{\theta_n - \theta_*}^2 - \frac{2}{L} F(\theta_n)^T (\theta_n - \theta_*) + \frac{1}{L^2} \norm{F'(\theta_n)}^2. \end{aligned} \]

Taking the previous result with $y = \theta_*$, $\theta = \theta_n$, we have \[ \begin{aligned} 0 \leq F'(\theta_n)^T(\theta - \theta_*) - \frac{1}{L} \left(1 - \frac{1}{2} \right) \norm{F'(\theta_n)}^2 - \frac{\mu}{2}\norm{\theta_n - \theta_*}^2, \end{aligned} \]

or equivalently, \[ \begin{aligned} &- \frac{2}{L} F'(\theta_n)^T(x - \theta_*) + \frac{1}{L^2} \norm{F'(\theta_n)}^2 \leq - \frac{\mu}{L}\norm{\theta_n - \theta_*}^2 \end{aligned} \]

Injecting, we obtain \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &\leq \left( 1 - \frac{\mu}{L}\right)\norm{\theta_n - \theta_*}^2 \end{aligned} \]

and iterating \[ \begin{aligned} \norm{\theta_{n+1} - \theta_*}^2 &\leq \left( 1 - \frac{1}{\kappa}\right)^{n+1} \norm{\theta_0 - \theta_*}^2 \end{aligned} \]

Optimal rate for first-order methods

For any algorithm that select $\theta_{n+1}$ in the set \[ \theta_0 + \mathrm{span}\left\{F'(\theta_0), \ldots, F'(\theta_n)\right\}, \]

for any $n \leq (d - 1) / 2$ there exist $f$ $L$-smooth such that for any method \[ F(\theta_n)- F_* \geq \frac{3}{32}\frac{L\norm{\theta_0 - \theta_*}^2}{(n+1)^2}. \]

Introductory Lectures on Convex Programming, Yuri Nesterov, theorem 2.1.6.

Nesterov accelerated gradient

$\theta_{-1} = \theta_0 \in \reel^d$, \[ \begin{cases} y_n &= \theta_n + \frac{n}{n+3} (\theta_n - \theta_{n-1})\\ \theta_{n+1} &= y_n - \gamma F'(y_n). \end{cases} \]

\[ F(\theta_n)- F_* \leq \mathcal{O}\left(\frac{L\norm{\theta_0 - \theta_*}^2}{n^2} \right) \]

Works very well for batch Gradient Descent but very sensitive to noise -> does not work in the stochastic setting.

Smooth minimization of non-smooth functions, Yuri Nesterov, 2004, theorem 2.

Momentum GD (a.k.a Heavy Ball GD)

$\theta_0 \in \reel^d$, $m_0 = 0 \in \reel^d$, $\beta \in [0, 1]$, \[ \begin{cases} m_{n+1} &= \beta m_n + F'(\theta_n),\\ \theta_{n+1} &= \theta_n - \gamma m_{n+1}. \end{cases} \]

Equivalent to improving the condition number $\kappa$. Not very popular for batch optimisation, but used a lot for stochastic optimisation in deep learning.

Origin: Some methods of speeding up the convergence of iteration methods , Polyak, 1964 (restricted access, should work from an ENS IP address).
Importance in deep learning: On the importance of initialization and momentum in deep learning , Sutskever et al, ICML 2013.

Second order methods

Very efficient for small case problems ($d$ small): \[ \theta_{n+1} = \theta_n - \gamma F''(\theta_n)^{-1} F'(\theta_n) \]

Equivalent to having $\kappa := \frac{L}{\mu} = 1$.

For large $d$, computing $F''$ is prohibitive. A DenseNet has 8 millions parameters. Storing $F''$ requires 232 TB of memory!

Possiblity to approximate $F''$ for large scale problems -> L-BFGS.

Like Nesterov acceleration, very sensitive to noise, not suitable for the stochastic setting.

Plan for today:

Stochastic Optimization

Stochastic gradient descent

$f$ convex random function, \[ F(\theta) = \esp{f(\theta)}. \]

Remember our initial application: \[ F(\theta) := \espr{X, Y}{l(m_\theta(X), Y)}, \] \[ f(\theta) := l(m_\theta(X), Y) \]

Stochastic gradient descent

At each iteration $n$, sample $f_n$ \[ \theta_{n+1} = x_n - \gamma_n f_n'(\theta_n). \]

For a finite dataset of size N, $f_n(\theta) = l(m_\theta(X_{I_n}), Y_{I_n})$ with $(I_n)_{n \geq 0}$ i.i.d uniform integers in [1, N] (sampling with replacement).

$f_n'(\theta_n)$ is an unbiased estimator of $F'(\theta_n)$, i.e. \[ \esp{f_n'(\theta_n) | \theta_n} = F'(\theta_n). \]

$f_n'(\theta_n)$ points in the right direction on average and much cheaper to compute: trade-off between speed of computation and accuracy.

\begin{align} \esp{\norm{\theta_{n+1} - \theta_*}^2 | \theta_n} &= \norm{\theta_n - \theta_*}^2 - 2 \gamma \esp{f_n'(\theta_n)^T (\theta_n - \theta_*) | \theta_n}\\ &\qquad+ \gamma^2 \esp{\norm{f_n'(\theta_n)}^2 | \theta_n}\\ &=\norm{\theta_n - \theta_*}^2 - 2 \gamma F'(\theta_n)^T (\theta_n - \theta_*)\\ &\qquad+ \gamma^2 \esp{\norm{f_n'(\theta_n)}^2 | \theta_n}\\ &=\norm{\theta_n - \theta_*}^2 - 2 \gamma F'(\theta_n)^T (\theta_n - \theta_*)\\ &\qquad+ \gamma^2 \norm{F'(\theta_n)}^2 + \gamma^2 \color{red}{\variance{f_n'(\theta_n) | \theta_n}}. \end{align}

Extra term compared to batch GD!

Far from the optimum, $\norm{F'(\theta)}^2$ dominates -> similar to GD.

Close to the optimum, $\variance{f_n'(\theta_*) | \theta_n}$ dominates -> stochastic noise.

However one SGD iteration is $N$ times faster than GD ! ($N$ size of the dataset)

Convergence of fixed step-size SGD

Let us assume $\gamma_n := \gamma$, $\mu \preceq F'' \preceq L$ and a.s $f'' \preceq R^2$. If $\gamma < \frac{L}{2}$ then, \[ \esp{F(\theta_n)} - F_* \leq (1 - \gamma \mu)^n \gamma^{-1}\delta_0^2 + 2 \gamma \frac{\sigma^2}{\mu}, \] where $\delta_0=\norm{\theta_0 - \theta_*}$ and $\sigma^2=\esp{\norm{f'(\theta_*)}^2}$

Far from the optimum, error decays exponentially fast.

Close to the optimum, error proportional to $\gamma$.

Reducing $\gamma$ -> better solution. Repeat until you are happy.

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm, Needell et al, 2015. Theorem 2.1.

Convergence of decaying step size SGD

For the strictly convex case, with $\gamma_n \propto \frac{1}{n}$, \[ \esp{f(x_n)} - F_* \leq \mathcal{O}\left(\frac{1}{n}\right) \]

For the non strictly convex case, with $\gamma_n \propto \frac{1}{n}$, \[ \esp{f(x_n)} - F_* \leq \mathcal{O}\left(\frac{1}{\sqrt{n}}\right) \]

Reference: Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning Bach and Moulines, NIPS 2011.

Why SGD beats GD?

GD has better convergence rate but prohibitive iteration cost.

Far from the optimum, each iteration of SGD will give as much improvement as GD.

In practice, $F_N$ is an approximation of $F$. More important to handle larger $N$ than to optimize perfectly for a small $N$.

See The Tradeoffs of Large Scale Learning, Bottou and Bousquet, NIPS 2008.

Over parametrization -> $f'(\theta_*) = 0$ almost surely. So that $\sigma^2 = 0$ and SGD converges as fast as GD.

See The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning, Ma et al, 2017.

Mini-batch SGD

In SGD, we sample one function $f$ at each iteration.

Mini batch SGD: \[ \theta_{n+1} = \theta_n - \gamma \sum_{i=1}^B f'_{i, n}(\theta_n) \]

Less variance of the gradient estimate, in particular $\sigma^2_B = \frac{\sigma^2}{B}$.

Easy to parallelize (better efficiency on GPUs in particular).

If $B$ too large, same as GD -> sweet spot to find for each problem.

Momentum SGD

SGD with momentum: \[ \begin{cases} m_{n+1} &= \beta m_n + f'_n(\theta_n),\\ \theta_{n+1} &= \theta_n - \gamma m_{n+1}. \end{cases} \]

Almost always used with SGD for deep learning. Why it helps so much for deep learning is not entirely understood.

See On the importance of initialization and momentum in deep learning , Sutskever et al, ICML 2013.

Plan for today:

Adaptive methods

Limitation of (S)GD?

  • Requires knowledge of L (otherwise can diverge)
  • Requires proper parametrisation (remember the condition number!): inadequate scaling of the problem will slow convergence.
  • Fixed learning rate through training. What if $L$ changes locally?

Solution? Automatically pick a per coordinate learning rate!

Adagrad

For any vector $x$, $x^{(k)}$ represents its k-th component. With parameters $\alpha, \epsilon$, taking $g_0^{(k)} = 0$, \begin{cases} g_{n+1}^{(k)} &= g_{n}^{(k)} + \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\sqrt{ \epsilon + g_{n+1}^{(k)}}}f_n^{'(k)}(\theta_n). \end{cases}

See Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, Duchi et al, JMLR 2011.

Adagrad properties

Much less sensitive to the choice of $\alpha$ than SGD.

Scale invariance i.e. reparameterizing the space by a scalar factor does not affect the trajectory.

Downside: learning rate goes to zero. Never forgets about the past.

RMS Prop

How to forget about the past? Using an exponential average.

With parameters $\alpha, \beta, \epsilon$, taking $g_0^{(k)} = 0$, \begin{cases} g_{n+1}^{(k)} &= \beta g_{n}^{(k)} + (1 - \beta) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\sqrt{ \epsilon + g_{n+1}^{(k)}}}f_n^{'(k)}(\theta_n). \end{cases}

Key idea for RMSProp and optimizers in the next slides -> updates of size roughly $\alpha$ for all coordinates.

No paper for this one. Was taught by Hinto on Coursera, you can look at his slides.

Adam

Adam = RMS Prop + momentum. With parameters $\alpha, \beta_1, \beta_2, \epsilon$, taking $v_0^{(k)} = m_0^{(k)} = 0$,

\begin{cases} m_{n+1}^{(k)} &= \beta_1 m_n^{(k)} + (1 - \beta_1) f_n^{'(k)}(\theta_n),\\ v_{n+1}^{(k)} &= \beta_2 v_n^{(k)} + (1 - \beta_2) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \hat{m}_{n+1} &= \frac{m_{n+1}}{1 - \beta_1^{n+1}},\quad \hat{v}_{n+1} = \frac{v_{n+1}}{1 - \beta_2^{n+1}},\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\epsilon + \sqrt{\hat{v}_{n+1}}} m_{n+1}. \end{cases}

$\hat{m}$ and $\hat{v}$ are corrections for the initial scale of the gradients. $m_1 = (1 - \beta_1) f'_0(\theta_n)$ while $\hat{m}_1 = f'_0(\theta_n)$.

See Adam: A method for stochastic optimization, Kingma and Ba, ICLR 2015.

AMSGrad

Adam with decreasing step size. With parameters $\alpha, \beta_1, \beta_2, \epsilon$, taking $v_0^{(k)} = m_0^{(k)} = 0$,

\begin{cases} m_{n+1}^{(k)} &= \beta_1 m_n^{(k)} + (1 - \beta_1) f_n^{'(k)}(\theta_n),\\ v_{n+1}^{(k)} &= \beta_2 v_n^{(k)} + (1 - \beta_2) \left(f_n^{'(k)}(\theta_n)\right)^2,\\ \hat{m}_{n+1} &= \frac{m_{n+1}}{1 - \beta_1^{n+1}},\\ \hat{v}_{n+1} &= \color{red}{\max}\left(\hat{v}_n, \frac{v_{n+1}}{1 - \beta_2^{n+1}}\right),\\ \theta_{n+1}^{(k)} &= \theta_{n}^k - \frac{\alpha}{\epsilon + \sqrt{\hat{v}_{n+1}}} m_{n+1}. \end{cases}

See On the convergence of Adam and Beyond, Reddi et al, ICLR 2018.

Plan for today:

Practical optimisation

Some boilerplate for optimising in Pytorch


from torch import nn
from torch import optim
from torch.utils.data import DataLoader

def train(model, device, train_loader, optimizer):
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, target)
        loss.backward()
        optimizer.step()

train_loader = DataLoader(my_dataset, shuffle=True, batch_size=64)
device = "cpu" # or "cuda"
model = MyModel().to(device)
optimizer = optim.Adam(model.parameters())
train(model, device, optimizer, train_loader)
                

Optimizers in Pytorch

All located in torch.optim.*.

All have similar constructor torch.optim.*(params, lr=..., momentum=..., weight_decay=0). Default values are different for all optimizer, check the doc.

params should be a list of all tensors to optimize over. Can be obtained from any module with module.parameters().

What is weight_decay? If we denote $\lambda$ its value, then equivalent to optimizing \[ \min_{\theta\in\reel^d} \espr{X, Y}{l(m_\theta(X), Y) + \frac{\lambda}{2} \norm{\theta}_2^2} \]

Act as a regularizer. Very useful in vision, much less in NLP (Natural Language Processing). When working on a specific application, check the best practices in papers.

See Pytorch documentation.

Parameter groups

Different tensors can require different optimization parameters.


model_part1 = ...
model_part2 = ...
optimizer = torch.optim.SGD([
    {"params": model_part1.parameters()},
    {"params": model_part2.parameters(),
        "lr": 0.1, "weight_decay":1e-4}],
    lr=0.01)
# model_part1 will use lr=0.01 and no weight decay
# model_part2 will use lr=0.1 and weight decay of 1e-4
                    

Learning rate schedule

Once the loss is no longer decreasing, reducing the learning rate can help.

Typical approach in vision, divide learning rate by D every T epochs.


model = ...
optimizer = torch.optim.SGD(
    model.parameters(), "lr": 0.01, "momentum": 0.9)
scheduler = optim.lr_scheduler.StepLR(
    optimizer, step_size=20, gamma=0.1)
for epoch in range(epochs):
    scheduler.step()
    ...
                    

A recipe for quick results

  • Start with Adam, default parameters.
  • Unless you are doing vision, then start from current state-of-the-art schedule for your dataset.
  • Set batch size to get 100% GPU usage or when GPU memory is full (check with nvidia-smi).
  • To improve results, grid search in log scale: weight_decays = [1e-5, 1e-4, 1e-3].

Notebook with comparison of optimizers on MNIST

diy_2018_optim notebook


wget http://ai.honu.io/presentations/notebooks/diy_2018_optim.ipynb
wget http://ai.honu.io/presentations/notebooks/environment.yml
conda env update
conda activate diy
jupyter notebook