Nesterov Accelerated Gradient
This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The structure only stored the chosen step size.
#![allow(unused)] fn main() { pub struct NAG { step_size: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl NAG { /// Create a new instance of NAG with a given step size. /// /// The step size should be 1 / L, where L is the Lipschitz constant /// of the gradient of the objective function. pub fn new(step_size: f64) -> Self { Self { step_size } } } }
The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k
. This helps the optimizer "look ahead" and converge faster in smooth convex problems.
Update steps:
- Compute a temporary point
y_{k+1}
by taking a gradient step fromx_k
. - Update the extrapolation coefficient
t_{k+1}
. - Combine
y_{k+1}
andy_k
using a weighted average to get the new iteratex_{k+1}
.
#![allow(unused)] fn main() { impl Optimizer for NAG { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let mut t_k: f64 = 1.0; let mut y_k = weights.clone(); for _ in 0..n_steps { let grad = grad_fn(weights); let mut y_next = weights.clone(); Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; }); let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt()); Zip::from(&mut *weights) .and(&y_next) .and(&y_k) .for_each(|x, &y1, &y0| { *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0); }); y_k = y_next; t_k = t_next; } } } }
Some notes:
-
We deliberately re-allocate multiple variables within the for loop (
grad
,y_next
,t_next
) but we could have pre-allocated buffers before the for loop. -
The algorithm keeps track of two sequences: the main iterate (
weights
) and the extrapolated one (y_k
). Before starting the for loop, we initializey_k
by cloning the weights:let mut y_k = weights.clone();
. -
The gradient is evaluated at the current weights, as in standard gradient descent:
let grad = grad_fn(weights);
. Sinceweights
is a mutable reference, we can pass it straightaway to ourgrad_fn
. -
A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity.
let mut y_next = weights.clone();
. -
We next compute:
y_{k+1} = x_k - α ∇f(x_k)
using an element-wise operation:Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });
. This time, we rely on theZip::from
trait implement byndarray
. -
The new weights are obtained by combining
y_{k+1}
andy_k
. The triple zip walks over the current weights and both extrapolation points:Zip::from(&mut *weights)...
.
This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray
.