Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Optimizers using ndarray

This section introduces a modular and idiomatic way to implement optimization algorithms in Rust using ndarray and traits. It is intended for readers who are already comfortable with basic Rust syntax and want to learn how to build reusable, extensible components in numerical computing.

In this example, we need to import the following types and traits:

#![allow(unused)]
fn main() {
use ndarray::Array;
use ndarray::Array1;
use ndarray::Zip;
}

You can inspect the full module that we're about the break down over here:

Click to view optimizers.rs
#![allow(unused)]
fn main() {
use ndarray::Array;
use ndarray::Array1;
use ndarray::Zip;

/// Trait for optimizers that update parameters using gradients.
///
/// Implementors must define a `run` method that takes mutable weights,
/// a gradient function, and the number of iterations to run.
// ANCHOR: trait
pub trait Optimizer {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    );
}
// ANCHOR_END: trait

/// Basic Gradient Descent (GD) optimizer.
///
/// Updates parameters in the direction of the negative gradient scaled
/// by a fixed step size.
// ANCHOR: struct_gd
pub struct GD {
    step_size: f64,
}
// ANCHOR_END: struct_gd

/// Create a new gradient descent optimizer.
///
/// # Arguments
/// - `step_size`: the learning rate.
// ANCHOR: impl_gd_new
impl GD {
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
// ANCHOR_END: impl_gd_new

/// Run the gradient descent optimizer.
///
/// For each step: `w ← w - step_size * grad(w)`
// ANCHOR: impl_gd_run
impl Optimizer for GD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            weights.zip_mut_with(&grads, |w, &g| {
                *w -= self.step_size * g;
            });
        }
    }
}
// ANCHOR_END: impl_gd_run

/// Accelerated Gradient Descent (AGD) with classical momentum.
///
/// Combines the previous velocity with the current gradient
/// to speed up convergence in convex problems.
// ANCHOR: struct_agd
pub struct AGD {
    step_size: f64,
    momentum: f64,
}
// ANCHOR_END: struct_agd

/// Create a new AGD optimizer.
///
/// # Arguments
/// - `step_size`: the learning rate.
/// - `momentum`: the momentum coefficient (typically between 0.8 and 0.99).
// ANCHOR: impl_agd_new
impl AGD {
    pub fn new(step_size: f64, momentum: f64) -> Self {
        Self {
            step_size,
            momentum,
        }
    }
}
// ANCHOR_END: impl_agd_new

/// Run AGD with momentum.
///
/// For each step:
/// ```text
/// v ← momentum * v - step_size * grad(w)
/// w ← w + v
/// ```
// ANCHOR: impl_agd_run
impl Optimizer for AGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let n: usize = weights.len();
        let mut velocity: Array1<f64> = Array::zeros(n);

        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            for ((w, g), v) in weights
                .iter_mut()
                .zip(grads.iter())
                .zip(velocity.iter_mut())
            {
                *v = self.momentum * *v - self.step_size * g;
                *w += *v;
            }
        }
    }
}
// ANCHOR_END: impl_agd_run

/// Adaptive Accelerated Gradient Descent using Nesterov's method.
///
/// This optimizer implements the variant from smooth convex optimization literature,
/// where extrapolation is based on the difference between consecutive y iterates.
///
/// References:
/// - Beck & Teboulle (2009), FISTA (but without proximal operator)
/// - Nesterov's accelerated gradient (original formulation)
// ANCHOR: AdaptiveAGD_struct
pub struct AdaptiveAGD {
    step_size: f64,
}
// ANCHOR_END: AdaptiveAGD_struct

// ANCHOR: AdaptiveAGD_impl_new
impl AdaptiveAGD {
    /// Create a new instance of AdaptiveAGD with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
// ANCHOR_END: AdaptiveAGD_impl_new

/// Run the optimizer for `n_steps` iterations.
///
/// # Arguments
/// - `weights`: mutable reference to the parameter vector (x₀), will be updated in-place.
/// - `grad_fn`: a function that computes ∇f(x) for a given x.
/// - `n_steps`: number of optimization steps to perform.
///
/// This implementation follows:
///
/// ```
/// y_{k+1} = x_k - α ∇f(x_k)
/// t_{k+1} = (1 + sqrt(1 + 4 t_k²)) / 2
/// x_{k+1} = y_{k+1} + ((t_k - 1)/t_{k+1}) * (y_{k+1} - y_k)
/// ```
// ANCHOR: AdaptiveAGD_impl_run
impl Optimizer for AdaptiveAGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
// ANCHOR_END: AdaptiveAGD_impl_run
}

1. Trait-based design

We define a trait called Optimizer to represent any optimizer that can update model weights based on gradients. In contrast to the previous sections where we mostly implemented stepfunctions, here the trait requires implementors to define a run method with the following signature:

#![allow(unused)]
fn main() {
pub trait Optimizer {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    );
}
}

This method takes:

  • A mutable reference to a vector of weights (Array1<f64>).
  • A function that computes the gradient of the loss with respect to the weights. This grad_fn function takes itself a borrowed reference to the weights &Array1<f64> and outputs a new array Array1<f64>.
  • The number of iterations to perform.

This trait run defines the whole optimization algorithm.

2. Gradient descent (GD)

The GD struct implements basic gradient descent with a fixed step size:

#![allow(unused)]
fn main() {
pub struct GD {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl GD {
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

And implements Optimizer by subtracting the gradient scaled by the step size from the weights at each iteration.

#![allow(unused)]
fn main() {
impl Optimizer for GD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            weights.zip_mut_with(&grads, |w, &g| {
                *w -= self.step_size * g;
            });
        }
    }
}
}

Some notes:

  • At each iteration, we compute the gradient with let grads = grad_fn(weights), which is fine but it reallocates a new vector at each call. If we wanted to optimize the gradient computation, we could pre-allocate a buffer outside the loop and pass a mutable reference into the gradient function to avoid repeated allocations. This would require to change the signature of the grad_fn.

  • weights.zip_mut_with(&grads, |w, &g| {{ ... }}): This is a mutable zip operation from the ndarray crate. It walks over weights and grads, applying the closure to each pair.

  • zip_mut_with is a method defined by the Zip trait, which is implemented for ArrayBase, and in particular for Array1. That’s why we can call it directly on weights.

  • In the closure statement we wrote: |w, &g| *w -= self.step_size * g;. Here, w is a mutable reference to each weight element, so we dereference it using *w to update its value. The &g in the closure means we’re pattern-matching by reference to avoid cloning or copying each f64.

3. Accelerated gradient descent (AGD)

The AGD struct extends gradient descent with a classical momentum term:

#![allow(unused)]
fn main() {
pub struct AGD {
    step_size: f64,
    momentum: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl AGD {
    pub fn new(step_size: f64, momentum: f64) -> Self {
        Self {
            step_size,
            momentum,
        }
    }
}
}

This algorithm adds momentum to classical gradient descent. Instead of updating weights using just the current gradient, it maintains a velocity vector that accumulates the influence of past gradients. This helps smooth the trajectory and accelerates convergence on convex problems.

#![allow(unused)]
fn main() {
impl Optimizer for AGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let n: usize = weights.len();
        let mut velocity: Array1<f64> = Array::zeros(n);

        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            for ((w, g), v) in weights
                .iter_mut()
                .zip(grads.iter())
                .zip(velocity.iter_mut())
            {
                *v = self.momentum * *v - self.step_size * g;
                *w += *v;
            }
        }
    }
}
}

Some notes:

  • We initialize a vector of zeros to track the momentum (velocity) across steps. It has the same length as the weights. This is achieved with: let mut velocity: Array1<f64> = Array::zeros(n). Note that we could have defined the velocity as an internal state variable within the struct defintion.

  • We use a triple nested zip to unpack the values of the weights, gradients, and velocity: for ((w, g), v) in weights.iter_mut().zip(grads.iter()).zip(velocity.iter_mut()). Here,

    • weights.iter_mut() gives a mutable reference to each weight,
    • grads.iter() provides read-only access to each gradient,
    • velocity.iter_mut() allows in-place updates of the velocity vector.

    This pattern allows us to update everything in one pass, element-wise.

  • Within the nested zip closure, we update the velocity using the momentum term and current gradient: *v = self.momentum * *v - self.step_size * g;

  • The weight is updated using the new velocity: *w += *v;. Again, we dereference w because it's a mutable reference.

4. Nesterov-based adaptive AGD

The AdaptiveAGD struct implements a more advanced optimizer based on Nesterov's method and FISTA:

#![allow(unused)]
fn main() {
pub struct AdaptiveAGD {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl AdaptiveAGD {
    /// Create a new instance of AdaptiveAGD with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k. This helps the optimizer "look ahead" and converge faster in smooth convex problems.

Update steps:

  • Compute a temporary point y_{k+1} by taking a gradient step from x_k.
  • Update the extrapolation coefficient t_{k+1}.
  • Combine y_{k+1} and y_k using a weighted average to get the new iterate x_{k+1}.
#![allow(unused)]
fn main() {
impl Optimizer for AdaptiveAGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
}

Some notes:

  • We deliberately re-allocate multiple variables within the for loop (grad, y_next, t_next) but we could have pre-allocated buffers before the for loop.

  • The algorithm keeps track of two sequences: the main iterate (weights) and the extrapolated one (y_k). Before starting the for loop, we initialize y_k by cloning the weights: let mut y_k = weights.clone();.

  • The gradient is evaluated at the current weights, as in standard gradient descent: let grad = grad_fn(weights);. Since weights is a mutable reference, we can pass it straightaway to our grad_fn.

  • A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity. let mut y_next = weights.clone();.

  • We next compute: y_{k+1} = x_k - α ∇f(x_k) using an element-wise operation: Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });. This time, we rely on the Zip::from trait implement by ndarray.

  • The new weights are obtained by combining y_{k+1} and y_k. The triple zip walks over the current weights and both extrapolation points: Zip::from(&mut *weights)....

This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray.

Summary

This design demonstrates a few Rust programming techniques:

  • Traits for abstraction and polymorphism
  • Structs to encapsulate algorithm-specific state
  • Use of the ndarray crate for numerical data
  • Generic functions using closures for computing gradients