Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Nesterov Accelerated Gradient

This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The structure only stored the chosen step size.

#![allow(unused)]
fn main() {
pub struct NAG {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl NAG {
    /// Create a new instance of NAG with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k. This helps the optimizer "look ahead" and converge faster in smooth convex problems.

Update steps:

  • Compute a temporary point y_{k+1} by taking a gradient step from x_k.
  • Update the extrapolation coefficient t_{k+1}.
  • Combine y_{k+1} and y_k using a weighted average to get the new iterate x_{k+1}.
#![allow(unused)]
fn main() {
impl Optimizer for NAG {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
}

Some notes:

  • We deliberately re-allocate multiple variables within the for loop (grad, y_next, t_next) but we could have pre-allocated buffers before the for loop.

  • The algorithm keeps track of two sequences: the main iterate (weights) and the extrapolated one (y_k). Before starting the for loop, we initialize y_k by cloning the weights: let mut y_k = weights.clone();.

  • The gradient is evaluated at the current weights, as in standard gradient descent: let grad = grad_fn(weights);. Since weights is a mutable reference, we can pass it straightaway to our grad_fn.

  • A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity. let mut y_next = weights.clone();.

  • We next compute: y_{k+1} = x_k - α ∇f(x_k) using an element-wise operation: Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });. This time, we rely on the Zip::from trait implement by ndarray.

  • The new weights are obtained by combining y_{k+1} and y_k. The triple zip walks over the current weights and both extrapolation points: Zip::from(&mut *weights)....

This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray.