Optimizers using ndarray
This section introduces a modular and idiomatic way to implement optimization algorithms in Rust using ndarray
and traits. It is intended for readers who are already comfortable with basic Rust syntax and want to learn how to build reusable, extensible components in numerical computing.
In this example, we need to import the following types and traits:
#![allow(unused)] fn main() { use ndarray::Array; use ndarray::Array1; use ndarray::Zip; }
You can inspect the full module that we're about the break down over here:
Click to view optimizers.rs
#![allow(unused)] fn main() { use ndarray::Array; use ndarray::Array1; use ndarray::Zip; /// Trait for optimizers that update parameters using gradients. /// /// Implementors must define a `run` method that takes mutable weights, /// a gradient function, and the number of iterations to run. // ANCHOR: trait pub trait Optimizer { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ); } // ANCHOR_END: trait /// Basic Gradient Descent (GD) optimizer. /// /// Updates parameters in the direction of the negative gradient scaled /// by a fixed step size. // ANCHOR: struct_gd pub struct GD { step_size: f64, } // ANCHOR_END: struct_gd /// Create a new gradient descent optimizer. /// /// # Arguments /// - `step_size`: the learning rate. // ANCHOR: impl_gd_new impl GD { pub fn new(step_size: f64) -> Self { Self { step_size } } } // ANCHOR_END: impl_gd_new /// Run the gradient descent optimizer. /// /// For each step: `w ← w - step_size * grad(w)` // ANCHOR: impl_gd_run impl Optimizer for GD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { for _ in 0..n_steps { let grads = grad_fn(weights); weights.zip_mut_with(&grads, |w, &g| { *w -= self.step_size * g; }); } } } // ANCHOR_END: impl_gd_run /// Accelerated Gradient Descent (AGD) with classical momentum. /// /// Combines the previous velocity with the current gradient /// to speed up convergence in convex problems. // ANCHOR: struct_agd pub struct AGD { step_size: f64, momentum: f64, } // ANCHOR_END: struct_agd /// Create a new AGD optimizer. /// /// # Arguments /// - `step_size`: the learning rate. /// - `momentum`: the momentum coefficient (typically between 0.8 and 0.99). // ANCHOR: impl_agd_new impl AGD { pub fn new(step_size: f64, momentum: f64) -> Self { Self { step_size, momentum, } } } // ANCHOR_END: impl_agd_new /// Run AGD with momentum. /// /// For each step: /// ```text /// v ← momentum * v - step_size * grad(w) /// w ← w + v /// ``` // ANCHOR: impl_agd_run impl Optimizer for AGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let n: usize = weights.len(); let mut velocity: Array1<f64> = Array::zeros(n); for _ in 0..n_steps { let grads = grad_fn(weights); for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(velocity.iter_mut()) { *v = self.momentum * *v - self.step_size * g; *w += *v; } } } } // ANCHOR_END: impl_agd_run /// Adaptive Accelerated Gradient Descent using Nesterov's method. /// /// This optimizer implements the variant from smooth convex optimization literature, /// where extrapolation is based on the difference between consecutive y iterates. /// /// References: /// - Beck & Teboulle (2009), FISTA (but without proximal operator) /// - Nesterov's accelerated gradient (original formulation) // ANCHOR: AdaptiveAGD_struct pub struct AdaptiveAGD { step_size: f64, } // ANCHOR_END: AdaptiveAGD_struct // ANCHOR: AdaptiveAGD_impl_new impl AdaptiveAGD { /// Create a new instance of AdaptiveAGD with a given step size. /// /// The step size should be 1 / L, where L is the Lipschitz constant /// of the gradient of the objective function. pub fn new(step_size: f64) -> Self { Self { step_size } } } // ANCHOR_END: AdaptiveAGD_impl_new /// Run the optimizer for `n_steps` iterations. /// /// # Arguments /// - `weights`: mutable reference to the parameter vector (x₀), will be updated in-place. /// - `grad_fn`: a function that computes ∇f(x) for a given x. /// - `n_steps`: number of optimization steps to perform. /// /// This implementation follows: /// /// ``` /// y_{k+1} = x_k - α ∇f(x_k) /// t_{k+1} = (1 + sqrt(1 + 4 t_k²)) / 2 /// x_{k+1} = y_{k+1} + ((t_k - 1)/t_{k+1}) * (y_{k+1} - y_k) /// ``` // ANCHOR: AdaptiveAGD_impl_run impl Optimizer for AdaptiveAGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let mut t_k: f64 = 1.0; let mut y_k = weights.clone(); for _ in 0..n_steps { let grad = grad_fn(weights); let mut y_next = weights.clone(); Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; }); let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt()); Zip::from(&mut *weights) .and(&y_next) .and(&y_k) .for_each(|x, &y1, &y0| { *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0); }); y_k = y_next; t_k = t_next; } } } // ANCHOR_END: AdaptiveAGD_impl_run }
1. Trait-based design
We define a trait called Optimizer
to represent any optimizer that can update model weights based on gradients. In contrast to the previous sections where we mostly implemented step
functions, here the trait requires implementors to define a run
method with the following signature:
#![allow(unused)] fn main() { pub trait Optimizer { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ); } }
This method takes:
- A mutable reference to a vector of weights (
Array1<f64>
). - A function that computes the gradient of the loss with respect to the weights. This
grad_fn
function takes itself a borrowed reference to the weights&Array1<f64>
and outputs a new arrayArray1<f64>
. - The number of iterations to perform.
This trait run
defines the whole optimization algorithm.
2. Gradient descent (GD)
The GD
struct implements basic gradient descent with a fixed step size:
#![allow(unused)] fn main() { pub struct GD { step_size: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl GD { pub fn new(step_size: f64) -> Self { Self { step_size } } } }
And implements Optimizer
by subtracting the gradient scaled by the step size from the weights at each iteration.
#![allow(unused)] fn main() { impl Optimizer for GD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { for _ in 0..n_steps { let grads = grad_fn(weights); weights.zip_mut_with(&grads, |w, &g| { *w -= self.step_size * g; }); } } } }
Some notes:
-
At each iteration, we compute the gradient with
let grads = grad_fn(weights)
, which is fine but it reallocates a new vector at each call. If we wanted to optimize the gradient computation, we could pre-allocate a buffer outside the loop and pass a mutable reference into the gradient function to avoid repeated allocations. This would require to change the signature of thegrad_fn
. -
weights.zip_mut_with(&grads, |w, &g| {{ ... }})
: This is a mutable zip operation from thendarray
crate. It walks overweights
andgrads
, applying the closure to each pair. -
zip_mut_with is a method defined by the
Zip
trait, which is implemented for ArrayBase, and in particular for Array1. That’s why we can call it directly on weights
. -
In the closure statement we wrote:
|w, &g| *w -= self.step_size * g;
. Here,w
is a mutable reference to each weight element, so we dereference it using*w
to update its value. The&g
in the closure means we’re pattern-matching by reference to avoid cloning or copying eachf64
.
3. Accelerated gradient descent (AGD)
The AGD
struct extends gradient descent with a classical momentum term:
#![allow(unused)] fn main() { pub struct AGD { step_size: f64, momentum: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl AGD { pub fn new(step_size: f64, momentum: f64) -> Self { Self { step_size, momentum, } } } }
This algorithm adds momentum to classical gradient descent. Instead of updating weights using just the current gradient, it maintains a velocity vector that accumulates the influence of past gradients. This helps smooth the trajectory and accelerates convergence on convex problems.
#![allow(unused)] fn main() { impl Optimizer for AGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let n: usize = weights.len(); let mut velocity: Array1<f64> = Array::zeros(n); for _ in 0..n_steps { let grads = grad_fn(weights); for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(velocity.iter_mut()) { *v = self.momentum * *v - self.step_size * g; *w += *v; } } } } }
Some notes:
-
We initialize a vector of zeros to track the momentum (velocity) across steps. It has the same length as the weights. This is achieved with:
let mut velocity: Array1<f64> = Array::zeros(n)
. Note that we could have defined the velocity as an internal state variable within the struct defintion. -
We use a triple nested zip to unpack the values of the weights, gradients, and velocity:
for ((w, g), v) in weights.iter_mut().zip(grads.iter()).zip(velocity.iter_mut())
. Here,weights.iter_mut()
gives a mutable reference to each weight,grads.iter()
provides read-only access to each gradient,velocity.iter_mut()
allows in-place updates of the velocity vector.
This pattern allows us to update everything in one pass, element-wise.
-
Within the nested zip closure, we update the velocity using the momentum term and current gradient:
*v = self.momentum * *v - self.step_size * g;
-
The weight is updated using the new velocity:
*w += *v;
. Again, we dereferencew
because it's a mutable reference.
4. Nesterov-based adaptive AGD
The AdaptiveAGD
struct implements a more advanced optimizer based on Nesterov's method and FISTA:
#![allow(unused)] fn main() { pub struct AdaptiveAGD { step_size: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl AdaptiveAGD { /// Create a new instance of AdaptiveAGD with a given step size. /// /// The step size should be 1 / L, where L is the Lipschitz constant /// of the gradient of the objective function. pub fn new(step_size: f64) -> Self { Self { step_size } } } }
This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k
. This helps the optimizer "look ahead" and converge faster in smooth convex problems.
Update steps:
- Compute a temporary point
y_{k+1}
by taking a gradient step fromx_k
. - Update the extrapolation coefficient
t_{k+1}
. - Combine
y_{k+1}
andy_k
using a weighted average to get the new iteratex_{k+1}
.
#![allow(unused)] fn main() { impl Optimizer for AdaptiveAGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let mut t_k: f64 = 1.0; let mut y_k = weights.clone(); for _ in 0..n_steps { let grad = grad_fn(weights); let mut y_next = weights.clone(); Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; }); let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt()); Zip::from(&mut *weights) .and(&y_next) .and(&y_k) .for_each(|x, &y1, &y0| { *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0); }); y_k = y_next; t_k = t_next; } } } }
Some notes:
-
We deliberately re-allocate multiple variables within the for loop (
grad
,y_next
,t_next
) but we could have pre-allocated buffers before the for loop. -
The algorithm keeps track of two sequences: the main iterate (
weights
) and the extrapolated one (y_k
). Before starting the for loop, we initializey_k
by cloning the weights:let mut y_k = weights.clone();
. -
The gradient is evaluated at the current weights, as in standard gradient descent:
let grad = grad_fn(weights);
. Sinceweights
is a mutable reference, we can pass it straightaway to ourgrad_fn
. -
A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity.
let mut y_next = weights.clone();
. -
We next compute:
y_{k+1} = x_k - α ∇f(x_k)
using an element-wise operation:Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });
. This time, we rely on theZip::from
trait implement byndarray
. -
The new weights are obtained by combining
y_{k+1}
andy_k
. The triple zip walks over the current weights and both extrapolation points:Zip::from(&mut *weights)...
.
This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray
.
Summary
This design demonstrates a few Rust programming techniques:
- Traits for abstraction and polymorphism
- Structs to encapsulate algorithm-specific state
- Use of the
ndarray
crate for numerical data - Generic functions using closures for computing gradients