Nesterov Accelerated Gradient

This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The structure only stored the chosen step size.

#![allow(unused)]
fn main() {
pub struct NAG {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl NAG {
    /// Create a new instance of NAG with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k. This helps the optimizer "look ahead" and converge faster in smooth convex problems.

Update steps:

Compute a temporary point y_{k+1} by taking a gradient step from x_k.
Update the extrapolation coefficient t_{k+1}.
Combine y_{k+1} and y_k using a weighted average to get the new iterate x_{k+1}.

#![allow(unused)]
fn main() {
impl Optimizer for NAG {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
}

Some notes:

We deliberately re-allocate multiple variables within the for loop (grad, y_next, t_next) but we could have pre-allocated buffers before the for loop.
The algorithm keeps track of two sequences: the main iterate (weights) and the extrapolated one (y_k). Before starting the for loop, we initialize y_k by cloning the weights: let mut y_k = weights.clone();.
The gradient is evaluated at the current weights, as in standard gradient descent: let grad = grad_fn(weights);. Since weights is a mutable reference, we can pass it straightaway to our grad_fn.
A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity. let mut y_next = weights.clone();.
We next compute: y_{k+1} = x_k - α ∇f(x_k) using an element-wise operation: Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });. This time, we rely on the Zip::from trait implement by ndarray.
The new weights are obtained by combining y_{k+1} and y_k. The triple zip walks over the current weights and both extrapolation points: Zip::from(&mut *weights)....

This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray.

Keyboard shortcuts

Learn Rust with Applied Sciences

Nesterov Accelerated Gradient