Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Welcome!

Welcome to Rustineers, a dive into the Rust programming language through the lens of applied mathematics and science. There are already several high-quality resources available for learning Rust:

You can find even more learning material at rust-lang.org.

This book is meant to be complementary to those great resources. Our goal is to learn Rust by implementing practical examples drawn from applied mathematics, including:

  • Machine Learning
  • Statistics and Probability
  • Optimization
  • Ordinary Differential Equations (ODEs)
  • Partial Differential Equations (PDEs)
  • And other topics from engineering and physics

Each chapter centers around a specific scientific algorithm or computational problem. We explore how to implement it idiomatically in Rust and sometimes in multiple styles.

Hopefully, we manage to go through the core concepts of Rust, namely:

  • Ownership / borrowing
  • Data types
  • Traits
  • Modules
  • Error handling
  • Macros

Most examples being with Rust's standard library which seems to be a solid foundation for learning both the language and its ecosystem.

Difficulty Levels

To help you navigate the material, each chapter is marked with a difficulty level using 🦀 emojis:

  • 🦀 — Beginner-friendly
  • 🦀🦀 — Intermediate
  • 🦀🦀🦀 — Advanced

As this is a work in progress, the difficulty levels might not always be well chosen.

Roadmap

Here's an unordered list of examples of topics that could be added to the book:

  • 1D Ridge regression.
  • Simple first-order gradient descent algorithms.
  • Multivariate regression algorithms (Ridge, Lasso, Elastic-net).
  • Classification algorithms: logistic regression.
  • Some clustering algorithms: K-means, Gaussian mixtures.
  • Some MCMC algorithms: MH, LMC, MALA.
  • Numerical methods for solving ODEs: Euler, Runge-Kutta.
  • Numerical methods for solving PDEs: 1D heat equation, 2D Poisson equation.
  • Optimization algorithms: gradient-based, derivative-free.
  • Kernel methods: kernel Ridge, Gaussian processes, etc.
  • Divergences and distances for probability distributions: KL divergence, total variation, Wasserstein.

Let us know if you have other ideas or if you want to improve any existing chapter.

Cargo 101

This chapter gives you everything you need to compile and run the examples in this book using Cargo, Rust’s official package manager and build tool.

Creating a new project

To create a new Rust library project:

cargo new my_project --lib

To create a binary project (i.e., one with a main.rs):

cargo new my_project

Building your project

Navigate into the project directory:

cd my_project

Then build it:

cargo build

This compiles your code in debug mode (faster builds, less optimization). You’ll find the output in target/debug/.

Running your code

If it’s a binary crate (with a main.rs), you can run it:

cargo run

This compiles and runs your code in one go.

Testing your code

To run tests in lib.rs or in any #[cfg(test)] module:

cargo test

Cleaning build artifacts

Remove the target/ directory and everything in it:

cargo clean

Checking your code (without building)

cargo check

This quickly verifies your code compiles without generating the binary.

Adding dependencies

To add dependencies, open Cargo.toml and add them under [dependencies]:

[dependencies]
ndarray = "0.15"

Or use the command line:

cargo add ndarray

Ridge regression 1D

Here, we implement one-dimensional Ridge Regression*in several styles, using increasing levels of abstraction. It's designed as a learning path for beginners, and focuses on writing idiomatic, clear, and type-safe Rust code. We focus on minimizing this loss function:

where: is an input covariate, is the associated output, is the Ridge coefficient, is the regularization strength.

What you'll learn

This module is perfect if you're just starting with Rust and want to:

  • Write beginner-friendly numerical code
  • Understand how to manipulate vectors (Vec<f64>) and slices (&[f64])
  • Write clean and safe code using the standard library only
  • Structure your functions with clear responsibilities
  • Learn good Rust patterns: immutability, iterators, ownership
  • How to make a simple crate out of it
  • How to expose a public API

If you want to implement and run this example while you read but are not familiar with Cargo yet, have a look at Cargo 101 for how to set up your project.

Functional · std only 🦀

🧠 This implementation uses:

  • Functional style
  • Rust standard library only
  • No struct, enum, traits, or external crates

Ridge loss

In this example, we implement one-dimensional Ridge Regression loss using only the Rust standard library, without any external crates. This lets us focus on core Rust features such as slices, iterators, and type safety.

Although the loss function by itself isn't really useful to solve the Ridge problem, implementing it provides a simple and focused introduction to Rust.

Naive implementation

We now present a straightforward implementation of the Ridge regression loss function:

#![allow(unused)]
fn main() {
pub fn loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let y_hat: Vec<f64> = mul_scalar_vec(beta, x);
    let residuals: Vec<f64> = subtract_vectors(y, &y_hat);
    let mse: f64 = residuals.iter().map(|x| x * x).sum::<f64>() / (n as f64);
    mse + lambda2 * beta * beta
}
}

In this example, we use two helper functions that we implement ourselves. A helper function for multiplying a vector by a scalar:

#![allow(unused)]
fn main() {
pub fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> {
    vector.iter().map(|x| x * scalar).collect()
}
}

We also defined a helper that subtracts two slices element-wise:

#![allow(unused)]
fn main() {
pub fn subtract_vectors(a: &[f64], b: &[f64]) -> Vec<f64> {
    assert_eq!(a.len(), b.len(), "Input vectors must have the same length");
    a.iter().zip(b.iter()).map(|(x, y)| x - y).collect()
}
}

Rather than using explicit loops, this implementation uses Rust’s iterator combinators, which the compiler optimizes into efficient code. This zero-cost abstraction keeps the code both readable and fast.

Ownership and borrowing

In Rust, every value has a single owner. When you assign a value to a new variable or pass it to a function by value, ownership is transferred (moved).

Borrowing allows you to use a value without taking ownership of it. Borrowing is done using references:

  • &T is a shared (read-only) reference.
  • &mut T is a mutable reference.

These references allow access to data without moving it.

A function like this:

#![allow(unused)]
fn main() {
fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> {
    vector.iter().map(|x| x * scalar).collect()
}
}

does not take ownership of the input vector. Instead, it borrows it for the duration of the function call. This makes it easier to reuse the input vector later.

If we instead defined:

#![allow(unused)]
fn main() {
fn mul_scalar_vec(scalar: f64, vector: Vec<f64>) -> Vec<f64> { ... }
}

then passing a vector would move ownership:

#![allow(unused)]
fn main() {
let v = vec![1.0, 2.0, 3.0];
let result = mul_scalar_vec(2.0, v); // v is moved here
let v2 = v; // error: value borrowed after move
}

Why use &[f64] instead of Vec<f64>?

The type &[f64] seems to be commonly used in function signatures because it works with both arrays and vectors.

Finally, note that:

  • Vec<f64> is an owned, growable vector on the heap. The only time we return a Vec<f64> is when we allocate a new output vector, like in mul_scalar_vec.
  • &Vec<f64> is a shared reference to a Vec<f64>.
  • &[f64] is a slice, i.e., a borrowed view into an array or vector.

In this chapter, we will mostly use these types but things can easily get more tricky.

Inlined iterator-based implementation

Let's implement the loss function in a more compact way. Instead of breaking the computation into multiple intermediate steps—like computing y_hat, residuals, and then squaring each residual—here we inline all computations into a single expression using iterators and closures.

This is ideal for demonstrating the expressive power of Rust's iterator API, especially once you're comfortable with basic slice handling and .map() chaining.

#![allow(unused)]
fn main() {
pub fn loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    let n: usize = y.len();
    let factor = n as f64;
    let mean_squared_error = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| {
            let residual = yi - beta * xi;
            residual * residual
        })
        .sum::<f64>()
        / factor;
    mean_squared_error + lambda2 * beta * beta
}
}

This implementation computes the mean squared error in a single iteration, minimizing allocations and abstraction overhead. In particular:

  • We use .iter().zip() to iterate over two slices.
  • We define a full code block inside the .map() closure, which makes it easier to write intermediate expressions like let residual = yi - beta * xi; before returning the squared value.

Closed-form solution

The one-dimensional Ridge regression problem admits a simple closed-form solution.
Given a dataset for , and a regularization parameter , the Ridge estimator is:

This form assumes that the data has no intercept term, i.e., the model is , or equivalently, that the data is centered around zero. In practice, it is common to subtract the means of both and before computing the estimator. This removes the intercept and gives:

We now implement this solution in Rust, using only the standard library.

#![allow(unused)]
fn main() {
/// Computes the one-dimensional Ridge regression estimator using centered data.
///
/// This version centers the input data `x` and `y` before applying the closed-form formula.
///
/// # Arguments
///
/// * `x` - A slice of input features.
/// * `y` - A slice of target values (same length as `x`).
/// * `lambda2` - The regularization parameter.
///
/// # Returns
///
/// * `f64` - The estimated Ridge regression coefficient.
///
/// # Panics
///
/// Panics if `x` and `y` do not have the same length.
pub fn ridge_estimator(x: &[f64], y: &[f64], lambda2: f64) -> f64 {
    let n: usize = x.len();
    assert_eq!(n, y.len(), "x and y must have the same length");

    let x_mean: f64 = x.iter().sum::<f64>() / n as f64;
    let y_mean: f64 = y.iter().sum::<f64>() / n as f64;

    let num: f64 = x
        .iter()
        .zip(y)
        .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean))
        .sum::<f64>();

    let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64);

    num / denom
}
}

You could also express it as a single iterator chain, similar to how we implemented the loss function earlier.

Gradient descent

As an exercise, we now implement the gradient descent algorithm to optimize the Ridge regression loss. The gradient has a closed-form expression and can be efficiently computed. We implement it in two different ways as we did for the loss function.

Gradient descent implementation

Gradient descent iteratively updates the parameter β using the gradient of the loss function:

Where η is the learning rate, and ∇βL(β) is the gradient of the loss.

We allow flexible experimentation by passing the gradient function as parameters:

#![allow(unused)]
fn main() {
pub fn gradient_descent(
    grad_fn: impl Fn(&[f64], &[f64], f64, f64) -> f64,
    x: &[f64],
    y: &[f64],
    lambda2: f64,
    lr: f64,
    n_iters: usize,
    init_beta: f64,
) -> f64 {
    let mut beta = init_beta;

    for _ in 0..n_iters {
        let grad = grad_fn(x, y, beta, lambda2);
        beta -= lr * grad;
    }

    beta
}
}

This version is generic, letting us plug in any valid grad_fn.

Gradient function: naive implementation

This version breaks the computation into two separate steps:

  • Compute the residuals
  • Compute the dot product between the residuals and the inputs:
  • Then assemble the get the gradient value

We first start by implementing our own dot function by relying on iterators, map chaining, and summing the results.

#![allow(unused)]
fn main() {
pub fn dot(a: &[f64], b: &[f64]) -> f64 {
    assert_eq!(a.len(), b.len(), "Input vectors must have the same length");
    a.iter().zip(b.iter()).map(|(xi, yi)| xi * yi).sum()
}
}

Our first implementation takes the following form:

#![allow(unused)]
fn main() {
pub fn grad_loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let residuals: Vec<f64> = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| yi - beta * xi)
        .collect();
    let residuals_dot_x = dot(&residuals, x);

    -2.0 * residuals_dot_x / (n as f64) + 2.0 * lambda2 * beta
}
}

Gradient function: inlined iterator-based implementation

In this version, we fuse the residual and gradient computation into a single iterator chain. This avoids intermediate memory allocations and takes full advantage of Rust’s zero-cost abstraction model.

#![allow(unused)]
fn main() {
pub fn grad_loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let grad_mse: f64 = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| 2.0 * (yi - beta * xi) * xi)
        .sum::<f64>()
        / (n as f64);

    -grad_mse + 2.0 * lambda2 * beta
}
}

Key differences:

  • The naive version allocates a temporary vectosr for the residuals and the dot product.
  • The inlined version is more idiomatic Rust: it avoids allocation and achieves better performance through iterator fusion.

Putting things together

To wrap up our 1D Ridge Regression example, let's see how all the parts fit together into a real Rust crate.

Project layout

Here’s the directory structure for our ridge_regression_1d crate:

crates/ridge_regression_1d/
├── Cargo.toml
└── src
    ├── analytical.rs          # Closed-form solution of the Ridge estimator
    ├── grad_functions.rs      # Gradient of the loss
    ├── lib.rs                 # Main entry point for the library
    ├── loss_functions.rs      # Loss function implementations
    ├── main.rs                # Binary entry point    
    ├── optimizer.rs           # Gradient descent
    └── utils.rs               # Utility functions (e.g., dot product)

All the functions discussed in the previous sections are implemented in analytical.rs, loss_functions.rs, grad_functions.rs, utils.rs, and optimizer.rs. You can inspect each of these files below.

Click to view analytical.rs
#![allow(unused)]
fn main() {
/// Computes the one-dimensional Ridge regression estimator using centered data.
///
/// This version centers the input data `x` and `y` before applying the closed-form formula.
///
/// # Arguments
///
/// * `x` - A slice of input features.
/// * `y` - A slice of target values (same length as `x`).
/// * `lambda2` - The regularization parameter.
///
/// # Returns
///
/// * `f64` - The estimated Ridge regression coefficient.
///
/// # Panics
///
/// Panics if `x` and `y` do not have the same length.
pub fn ridge_estimator(x: &[f64], y: &[f64], lambda2: f64) -> f64 {
    let n: usize = x.len();
    assert_eq!(n, y.len(), "x and y must have the same length");

    let x_mean: f64 = x.iter().sum::<f64>() / n as f64;
    let y_mean: f64 = y.iter().sum::<f64>() / n as f64;

    let num: f64 = x
        .iter()
        .zip(y)
        .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean))
        .sum::<f64>();

    let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64);

    num / denom
}
}
Click to view loss_functions.rs
#![allow(unused)]
fn main() {
use crate::utils::{mul_scalar_vec, subtract_vectors};

/// Computes the loss function for Ridge regression (naive version).
///
/// It implements it in a simple fashion by computing the mean squared error in multiple steps.
///
/// # Arguments
///
/// * `x` - The array of input observations
/// * `y` - The array of output observations
/// * `beta` - The coefficients of the linear regression
/// * `lambda2` - The regularization parameter
///
/// # Returns
///
/// The value of the loss function
/// Computes the Ridge regression loss function.
///
/// This function calculates the following expression:
///
/// $$
/// \mathcal{L}(\beta) = \frac{1}{2n} \sum_i (y_i - \beta x_i)^2 + \lambda \beta^2
/// $$
///
/// where:
/// - `x` and `y` are the input/output observations,
/// - `beta` is the linear coefficient,
/// - `lambda2` is the regularization strength.
///
/// # Arguments
///
/// * `x` - Input features as a slice (`&[f64]`)
/// * `y` - Target values as a slice (`&[f64]`)
/// * `beta` - Coefficient of the regression model
/// * `lambda2` - L2 regularization strength
///
/// # Returns
///
/// The Ridge regression loss value as `f64`.
///
/// # Panics
///
/// Panics if `x` and `y` do not have the same length.
// ANCHOR: loss_function_naive
pub fn loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let y_hat: Vec<f64> = mul_scalar_vec(beta, x);
    let residuals: Vec<f64> = subtract_vectors(y, &y_hat);
    let mse: f64 = residuals.iter().map(|x| x * x).sum::<f64>() / (n as f64);
    mse + lambda2 * beta * beta
}
// ANCHOR_END: loss_function_naive

/// Computes the loss function for Ridge regression (inlined version).
///
/// It implements it as a one-liner by computing the mean squared error in a single expression.
///
/// # Arguments
///
/// * `x` - The array of input observations
/// * `y` - The array of output observations
/// * `beta` - The coefficients of the linear regression
/// * `lambda2` - The regularization parameter
///
/// # Returns
///
/// The value of the loss function
// ANCHOR: loss_function_line
pub fn loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    let n: usize = y.len();
    let factor = n as f64;
    let mean_squared_error = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| {
            let residual = yi - beta * xi;
            residual * residual
        })
        .sum::<f64>()
        / factor;
    mean_squared_error + lambda2 * beta * beta
}
// ANCHOR_END: loss_function_line
}
Click to view grad_functions.rs
#![allow(unused)]
fn main() {
use crate::utils::dot;

/// Computes the gradient of the Ridge regression loss function (naive version).
///
/// This implementation first explicitly computes the residuals and then performs
/// a dot product between the residuals and the inputs.
///
/// # Arguments
///
/// * `x` - Slice of input features
/// * `y` - Slice of target outputs
/// * `beta` - Coefficient of the regression model
/// * `lambda2` - L2 regularization strength
///
/// # Returns
///
/// The gradient of the loss with respect to `beta`.
///
/// # Panics
///
/// Panics if `x` and `y` do not have the same length.
// ANCHOR: grad_loss_function_naive
pub fn grad_loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let residuals: Vec<f64> = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| yi - beta * xi)
        .collect();
    let residuals_dot_x = dot(&residuals, x);

    -2.0 * residuals_dot_x / (n as f64) + 2.0 * lambda2 * beta
}
// ANCHOR_END: grad_loss_function_naive

/// Computes the gradient of the Ridge regression loss function (inlined version).
///
/// This version fuses the residual and gradient computation into a single pass
/// using iterators, minimizing allocations and improving efficiency.
///
/// # Arguments
///
/// * `x` - Slice of input features
/// * `y` - Slice of target outputs
/// * `beta` - Coefficient of the regression model
/// * `lambda2` - L2 regularization strength
///
/// # Returns
///
/// The gradient of the loss with respect to `beta`.
///
/// # Panics
///
/// Panics if `x` and `y` do not have the same length.
// ANCHOR: grad_loss_function_inline
pub fn grad_loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let grad_mse: f64 = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| 2.0 * (yi - beta * xi) * xi)
        .sum::<f64>()
        / (n as f64);

    -grad_mse + 2.0 * lambda2 * beta
}
// ANCHOR_END: grad_loss_function_inline
}
Click to view optimizer.rs
#![allow(unused)]
fn main() {
/// Performs gradient descent to minimize the Ridge regression loss function.
///
/// # Arguments
///
/// * `grad_fn` - A function that computes the gradient of the Ridge loss
/// * `x` - Input features as a slice (`&[f64]`)
/// * `y` - Target values as a slice (`&[f64]`)
/// * `lambda2` - Regularization parameter
/// * `lr` - Learning rate
/// * `n_iters` - Number of gradient descent iterations
/// * `init_beta` - Initial value of the regression coefficient
///
/// # Returns
///
/// The optimized regression coefficient `beta` after `n_iters` updates
// ANCHOR: gradient_descent
pub fn gradient_descent(
    grad_fn: impl Fn(&[f64], &[f64], f64, f64) -> f64,
    x: &[f64],
    y: &[f64],
    lambda2: f64,
    lr: f64,
    n_iters: usize,
    init_beta: f64,
) -> f64 {
    let mut beta = init_beta;

    for _ in 0..n_iters {
        let grad = grad_fn(x, y, beta, lambda2);
        beta -= lr * grad;
    }

    beta
}
// ANCHOR_END: gradient_descent
}
Click to view utils.rs
#![allow(unused)]
fn main() {
/// Multiplies a vector by a scalar.
///
/// # Arguments
///
/// * `scalar` - A scalar multiplier
/// * `vector` - A slice of f64 values
///
/// # Returns
///
/// A new vector containing the result of element-wise multiplication
///
/// # Why `&[f64]` instead of `Vec<f64]`?
///
/// We use a slice (`&[f64]`) because:
/// - It's more general: works with both arrays and vectors
/// - It avoids unnecessary ownership
/// - It's idiomatic and Clippy-compliant
// ANCHOR: mul_scalar_vec
pub fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> {
    vector.iter().map(|x| x * scalar).collect()
}
// ANCHOR_END: mul_scalar_vec

/// Subtracts two vectors element-wise.
///
/// # Arguments
///
/// * `a` - First input slice
/// * `b` - Second input slice
///
/// # Returns
///
/// A new `Vec<f64>` containing the element-wise difference `a[i] - b[i]`.
///
/// # Panics
///
/// Panics if `a` and `b` do not have the same length.
// ANCHOR: subtract_vectors
pub fn subtract_vectors(a: &[f64], b: &[f64]) -> Vec<f64> {
    assert_eq!(a.len(), b.len(), "Input vectors must have the same length");
    a.iter().zip(b.iter()).map(|(x, y)| x - y).collect()
}
// ANCHOR_END: subtract_vectors

/// Dot product between two vectors.
///
/// # Arguments
/// * `a` - First input vector
/// * `b` - Second input vector
///
/// # Returns
///
/// The float value of the dot product.
///
/// # Panics
///
/// Panics if `a` and `b` do have the same length.
// ANCHOR: dot
pub fn dot(a: &[f64], b: &[f64]) -> f64 {
    assert_eq!(a.len(), b.len(), "Input vectors must have the same length");
    a.iter().zip(b.iter()).map(|(xi, yi)| xi * yi).sum()
}
// ANCHOR_END: dot
}

Note that the layout can be more complicated by introducing modules and submodules. This will be covered in the next chapter when we implement a structured-oriented version of the 1D Ridge regression.

What's lib.rs?

The lib.rs file is the entry point for the crate as a library. This is where we declare which modules (i.e., other .rs files) are exposed to the outside world.

#![allow(unused)]
fn main() {
pub mod grad_functions;
pub mod loss_functions;
pub mod optimizer;
pub mod utils;
}

Each line tells Rust:

“There is a file called X.rs that defines a module X. Please include it in the crate.”

By default, items inside a module are private. That’s where pub comes in.

We will dive deeper into lib.rs in the 2.1.5 Exposing API chapter.

Why pub?

If you want to use a function from another module or crate, you must declare it pub (public). For example:

#![allow(unused)]
fn main() {
// In utils.rs
pub fn dot(a: &[f64], b: &[f64]) -> f64 { ... }
}

If dot is not marked as pub, you can’t use it outside utils.rs, even from optimizer.rs.

Importing between modules

Rust requires explicit imports between modules. For example, in optimizer.rs, we want to use the dot function from utils.rs:

#![allow(unused)]
fn main() {
use crate::utils::dot;
}

Here, crate refers to the root of this library crate lib.rs. If you check out one of the modules again (e.g., loss_functions.rs), you'll notice that's exactly what we are doing to import functions from other modules.

Example of usage in main.rs

Now let's see how you could use the library from a binary crate:

fn main() {
    ridge_regression_1d::run_demo();
}

You can run this with cargo run.

Summary

This chapter demonstrated how to:

  • Implement the 1D Ridge regression in sample ways by relying on Rust standard library only.
  • Organize a crate into multiple source files (modules)
  • Use pub to expose functions
  • Import functions from other modules
  • Call everything together from a main.rs

This is idiomatic Rust structure and prepares you to scale beyond toy examples while staying modular and readable.

Exposing a clean API

Until now, we've manually chained together the loss, gradient, and optimization steps. This is great for learning, but in real projects, we often want a simplified and reusable API.

Rust gives us a clean way to do this by leveraging the lib.rs file as the public interface to our crate.

lib.rs as a public API

In your crate, lib.rs is responsible for organizing and exposing the components we want users to interact with.

We can re-export key functions and define top-level utilities like fit and predict. The complete lib.rs file now looks like this:

#![allow(unused)]
fn main() {
/// This module exposes the main components of the Ridge Regression 1D crate.
///
/// It re-exports internal modules and defines high-level helper functions
/// for training (`fit`) and predicting (`predict`) using Ridge regression.
pub mod grad_functions;
pub mod loss_functions;
pub mod optimizer;
pub mod utils;

pub use grad_functions::grad_loss_function_inline;
pub use optimizer::gradient_descent;

/// Fits a Ridge regression model using gradient descent.
///
/// # Arguments
///
/// * `x` - Input features (`&[f64]`)
/// * `y` - Target values (`&[f64]`)
/// * `lambda2` - Regularization strength
/// * `lr` - Learning rate
/// * `n_iters` - Number of gradient descent iterations
/// * `init_beta` - Initial value of the coefficient
///
/// # Returns
///
/// The optimized coefficient `beta` as `f64`.
pub fn fit(x: &[f64], y: &[f64], lambda2: f64, lr: f64, n_iters: usize, init_beta: f64) -> f64 {
    gradient_descent(
        grad_loss_function_inline,
        x,
        y,
        lambda2,
        lr,
        n_iters,
        init_beta,
    )
}

/// Predicts output values using a trained Ridge regression coefficient.
///
/// # Arguments
///
/// * `x` - Input features (`&[f64]`)
/// * `beta` - Trained coefficient
///
/// # Returns
///
/// A `Vec<f64>` with predicted values.
pub fn predict(x: &[f64], beta: f64) -> Vec<f64> {
    x.iter().map(|xi| xi * beta).collect()
}
}

Everything declared pub is available to the user.

Example of usage

You can update your binary entry point to try out the public API.

use ridge_regression_1d::{fit, predict};

fn main() {
    let x = vec![1.0, 2.0, 3.0];
    let y = vec![2.0, 4.0, 6.0];

    let beta = fit(&x, &y, 0.1, 0.01, 1000, 0.0);
    let preds = predict(&x, beta);

    println!("Learned beta: {}", beta);
    println!("Predictions: {:?}", preds);
}

Structured · std only 🦀

🦀 This implementation uses:

  • Structs and impl blocks
  • Rust standard library only
  • No external crates
  • Optional use of traits and encapsulation

Closed-form solution

We now present a structured implementation of the 1D Ridge estimator using a dedicated RidgeEstimator struct. We implement the same methods, i.e.,

  • fit: the method to compute the optimal from data and the regularization parameter ,
  • predict: the method to compute predictions from new data,

but rely on Rust's struct and impl to define a new type. We also an additional method new, a constructor to initialize the estimator with an initial value of .

Struct definition

This simple struct stores the estimated coefficient as a field.

#![allow(unused)]
fn main() {
pub struct RidgeEstimator {
    beta: f64,
}
}

Constructor and methods

Once the struct is defined, we can implement the constructor new, and the methods fit and predict.

#![allow(unused)]
fn main() {
impl RidgeEstimator {
    pub fn new(init_beta: f64) -> Self {
        Self { beta: init_beta }
    }

    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) {
        let n: usize = x.len();
        assert_eq!(n, y.len(), "x and y must have the same length");

        let x_mean: f64 = x.iter().sum::<f64>() / n as f64;
        let y_mean: f64 = y.iter().sum::<f64>() / n as f64;

        let num: f64 = x
            .iter()
            .zip(y)
            .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean))
            .sum::<f64>();

        let denom: f64 =
            x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64);

        self.beta = num / denom;
    }

    fn predict(&self, x: &[f64]) -> Vec<f64> {
        x.iter().map(|xi| self.beta * xi).collect()
    }
}
}

Note that we can decompose the implementation into as many blocks as we want:

#![allow(unused)]
fn main() {
impl RidgeEstimator {
    pub fn new(init_beta: f64) -> Self {
        Self { beta: init_beta }
    }
}

impl RidgeEstimator {
    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) {
        ...
    }
}

impl RidgeEstimator {
    fn predict(&self, x: &[f64]) -> Vec<f64> {
        ...
    }
}
}

This can be useful when dealing with complex methods.

Example of usage

Here is how we can use our new Ridge estimator:

fn main() {
    let x = vec![1.0, 2.0, 3.0, 4.0];
    let y = vec![2.1, 4.1, 6.2, 8.3];
    let lambda = 0.1;

    let mut model = RidgeEstimator::new(0.0);
    model.fit(&x, &y, lambda);

    let predictions = model.predict(&x);
    println!("Predictions: {:?}", predictions);
}

Gradient descent

As an another illustration of struct and impl, let's tackle the gradient descent method for the Ridge regression again. We use the following structure:

#![allow(unused)]
fn main() {
pub struct RidgeGradientDescent {
    beta: f64,
    n_iters: usize,
    lr: f64,
}
}

This struct stores the current coefficient , the number of iterations to run, and the learning rate. We can subsequently implement the constructor and all the methods required to perform gradient descent.

#![allow(unused)]
fn main() {
impl RidgeGradientDescent {
    pub fn new(n_iters: usize, lr: f64, init_beta: f64) -> Self {
        Self {
            beta: init_beta,
            n_iters,
            lr,
        }
    }

    fn grad_function(&self, x: &[f64], y: &[f64], lambda2: f64) -> f64 {
        assert_eq!(x.len(), y.len(), "x and y must have the same length");
        let n: usize = x.len();
        let grad_mse: f64 = x
            .iter()
            .zip(y.iter())
            .map(|(xi, yi)| {
                let error = yi - self.beta * xi;
                2.0 * error * xi
            })
            .sum::<f64>()
            / (n as f64);

        -grad_mse + 2.0 * lambda2 * self.beta
    }

    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) {
        for _ in 0..self.n_iters {
            let grad = self.grad_function(x, y, lambda2);
            self.beta -= self.lr * grad;
        }
    }

    fn predict(&self, x: &[f64]) -> Vec<f64> {
        x.iter().map(|xi| self.beta * xi).collect()
    }
}
}

Example of usage

Here is how we can use our new Ridge estimator:

fn main() {
    let x = vec![1.0, 2.0, 3.0, 4.0];
    let y = vec![2.1, 4.1, 6.2, 8.3];

    let mut model = RidgeGradientDescent::new(1000, 0.01, 0.0);
    model.fit(&x, &y, 0.1);

    let predictions = model.predict(&x);
    println!("Predictions: {:?}", predictions);
}

Base Ridge model using traits

We implement two different Ridge estimators in the previous sections. While these implementations solve the same problem, their fit logic is different. To unify their interface and promote code reuse, we can leverage Rust's trait mechanism.

We define a common trait RidgeModel, which describes the shared behavior of any Ridge estimator:

#![allow(unused)]
fn main() {
pub trait RidgeModel {
    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64);
    fn predict(&self, x: &[f64]) -> Vec<f64>;
}
}

Any type that implements this trait must provide a fit method to train the model and a predict method to make predictions.

Both implementations use the same logic to produce predictions from a scalar . We can move this logic to a shared helper function:

#![allow(unused)]
fn main() {
fn predict_from_beta(beta: f64, x: &[f64]) -> Vec<f64> {
    x.iter().map(|xi| beta * xi).collect()
}
}

Trait implementation: gradient descent method

Recall that our RidgeGradientDescent type is defined as follows:

#![allow(unused)]
fn main() {
pub struct RidgeGradientDescent {
    beta: f64,
    n_iters: usize,
    lr: f64,
}
}

We still need to define the constructor and the gradient function:

#![allow(unused)]
fn main() {
impl RidgeGradientDescent {
    pub fn new(n_iters: usize, lr: f64, init_beta: f64) -> Self {
        Self {
            beta: init_beta,
            n_iters,
            lr,
        }
    }

    fn grad_function(&self, x: &[f64], y: &[f64], lambda2: f64) -> f64 {
        assert_eq!(x.len(), y.len(), "x and y must have the same length");
        let n: usize = x.len();
        let grad_mse: f64 = x
            .iter()
            .zip(y.iter())
            .map(|(xi, yi)| {
                let error = yi - self.beta * xi;
                2.0 * error * xi
            })
            .sum::<f64>()
            / (n as f64);

        -grad_mse + 2.0 * lambda2 * self.beta
    }
}
}

Once this is done, we need to implement the required methods to be a RidgeModel:

#![allow(unused)]
fn main() {
impl RidgeModel for RidgeGradientDescent {
    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) {
        for _ in 0..self.n_iters {
            let grad = self.grad_function(x, y, lambda2);
            self.beta -= self.lr * grad;
        }
    }

    fn predict(&self, x: &[f64]) -> Vec<f64> {
        predict_from_beta(self.beta, x)
    }
}
}

Trait implementation: closed-form estimator

We do the same for the RidgeEstimator that uses the analytical formula:

#![allow(unused)]
fn main() {
pub struct RidgeEstimator {
    beta: f64,
}

impl RidgeEstimator {
    pub fn new(init_beta: f64) -> Self {
        Self { beta: init_beta }
    }
}

impl RidgeModel for RidgeEstimator {
    fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) {
        let n: usize = x.len();
        assert_eq!(n, y.len(), "x and y must have the same length");

        let x_mean: f64 = x.iter().sum::<f64>() / n as f64;
        let y_mean: f64 = y.iter().sum::<f64>() / n as f64;

        let num: f64 = x
            .iter()
            .zip(y)
            .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean))
            .sum::<f64>();

        let denom: f64 =
            x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64);

        self.beta = num / denom;
    }

    fn predict(&self, x: &[f64]) -> Vec<f64> {
        predict_from_beta(self.beta, x)
    }
}
}

That's it ! The usage remains the same but we slighly refactored our code.

Closed-form estimator with generic types and trait bounds

What are generics?

Generics let you write code that works with many types, not just one.

Instead of writing:

#![allow(unused)]
fn main() {
struct RidgeEstimator {
    beta: f64,
}
}

You can write:

#![allow(unused)]
fn main() {
struct RidgeEstimator<F> {
    beta: F,
}
}

Here, F is a type parameter — it could be f32, f64, or another type. In Rust, generic types have no behavior by default.

Bug

#![allow(unused)]
fn main() {
fn sum(xs: &[F]) -> F {
    xs.iter().sum() // This will not compile
}
}

The compiler gives an error: "F might not implement Sum, so I don’t know how to .sum() over it."

Trait bounds

To fix that, we must tell the compiler which traits F should implement.

For example:

#![allow(unused)]
fn main() {
impl<F: Float + Sum> RidgeModel<F> for GenRidgeEstimator<F> {
    ...
}
}

This means:

  • F must implement Float (it must behave like a floating point number: support powi, abs, etc.)
  • F must implement Sum (so we can sum an iterator of F)

This allows code like:

#![allow(unused)]
fn main() {
let mean = xs.iter().copied().sum::<F>() / F::from(xs.len()).unwrap();
}

Using generic bounds allows the estimator to work with f32, f64, or any numeric type implementing Float. The compiler can generate specialized code for each concrete type.

Generic Ridge estimator

Our Ridge estimator that works with f32 and f64 takes this form:

#![allow(unused)]
fn main() {
use num_traits::Float;
use std::iter::Sum;

/// A trait representing a generic Ridge regression model.
///
/// The model must support fitting to training data and predicting new outputs.
// ANCHOR: ridge_model_trait
pub trait RidgeModel<F: Float + Sum> {
    /// Fits the model to the given data using Ridge regression.
    fn fit(&mut self, x: &[F], y: &[F], lambda2: F);

    /// Predicts output values for a slice of new input features.
    fn predict(&self, x: &[F]) -> Vec<F>;
}
// ANCHOR_END: ridge_model_trait

/// A generic Ridge regression estimator using a single coefficient `beta`.
///
/// This implementation assumes a linear relationship between `x` and `y`
/// and performs scalar Ridge regression (1D).
// ANCHOR: gen_ridge_estimator
pub struct GenRidgeEstimator<F: Float + Sum> {
    beta: F,
}
// ANCHOR_END: gen_ridge_estimator

// ANCHOR: gen_ridge_estimator_impl
impl<F: Float + Sum> GenRidgeEstimator<F> {
    /// Creates a new estimator with the given initial beta coefficient.
    pub fn new(init_beta: F) -> Self {
        Self { beta: init_beta }
    }
}
// ANCHOR_END: gen_ridge_estimator_impl

// ANCHOR: gen_ridge_estimator_trait_impl
impl<F: Float + Sum> RidgeModel<F> for GenRidgeEstimator<F> {
    /// Fits the Ridge regression model to 1D data using closed-form solution.
    ///
    /// This method computes the regression coefficient `beta` by minimizing
    /// the Ridge-regularized least squares loss.
    ///
    /// # Arguments
    /// - `x`: Input features.
    /// - `y`: Target values.
    /// - `lambda2`: The regularization parameter (λ²).
    fn fit(&mut self, x: &[F], y: &[F], lambda2: F) {
        let n: usize = x.len();
        let n_f: F = F::from(n).unwrap();
        assert_eq!(x.len(), y.len(), "x and y must have the same length");

        let x_mean: F = x.iter().copied().sum::<F>() / n_f;
        let y_mean: F = y.iter().copied().sum::<F>() / n_f;

        let num: F = x
            .iter()
            .zip(y.iter())
            .map(|(xi, yi)| (*xi - x_mean) * (*yi - y_mean))
            .sum::<F>();

        let denom: F = x.iter().map(|xi| (*xi - x_mean).powi(2)).sum::<F>() + lambda2 * n_f;

        self.beta = num / denom;
    }

    /// Applies the trained model to input features to generate predictions.
    ///
    /// # Arguments
    /// - `x`: Input features to predict from.
    ///
    /// # Returns
    /// A vector of predicted values, one for each input in `x`.
    fn predict(&self, x: &[F]) -> Vec<F> {
        x.iter().map(|xi| *xi * self.beta).collect()
    }
}
// ANCHOR_END: gen_ridge_estimator_trait_impl
}

Notice that the trait bounds <F: Float + Sum> RidgeModel<F> are defined after the name of a trait or struct, or right next to an impl.

Summary

  • Generics support type-flexible code.
  • Trait bounds like <F: Float + Sum> constrain what operations are valid.
  • Without Sum, the compiler does not allow .sum() on iterators of F.

Try removing Sum from the bound:

#![allow(unused)]
fn main() {
impl<F: Float> RidgeModel<F> for GenRidgeEstimator<F>
}

And keep a call to .sum(). The compiler should complain:

error[E0599]: the method `sum` exists for iterator `std::slice::Iter<'_, F>`,
              but its trait bounds were not satisfied

To resolve this, add + Sum to the bound.

Closed-form solution with ndarray, Option, and error handling

This section introduces several important features of the language:

  • Using the ndarray crate for numerical arrays
  • Representing optional values with Option, Some, and None
  • Using pattern matching with match
  • Handling errors using Box<dyn Error> and .into()
  • Automatically deriving common trait implementations using #[derive(...)]

Motivation

In previous sections, we worked with Vec<f64> and returned plain values. In practice, we might need:

  • Efficient linear algebra tools, provided by external crates such as ndarray and nalgebra
  • A way to represent "fitted" or "not fitted" states, using Option<f64>
  • A way to return errors when something goes wrong, using Result<_, Box<dyn Error>>
  • Automatically implementing traits like Debug, Clone, and Default to simplify testing, debugging, and construction

We combine these in the implementation of the analytical RidgeEstimator.

The full code

#![allow(unused)]
fn main() {
use ndarray::Array1;
use std::error::Error;

/// A Ridge regression estimator using `ndarray` for vectorized operations.
///
/// This version supports fitting and predicting using `Array1<f64>` arrays.
/// The coefficient `beta` is stored as an `Option<f64>`, allowing the model
/// to represent both fitted and unfitted states.
// ANCHOR: struct
#[derive(Debug, Clone, Default)]
pub struct RidgeEstimator {
    beta: Option<f64>,
}
// ANCHOR_END: struct

// ANCHOR: ridge_estimator_impl_new_fit
impl RidgeEstimator {
    /// Creates a new, unfitted Ridge estimator.
    ///
    /// # Returns
    /// A `RidgeEstimator` with `beta` set to `None`.
    pub fn new() -> Self {
        Self { beta: None }
    }

    /// Fits the Ridge regression model using 1D input and output arrays.
    ///
    /// This function computes the coefficient `beta` using the closed-form
    /// solution with L2 regularization.
    ///
    /// # Arguments
    /// - `x`: Input features as a 1D `Array1<f64>`.
    /// - `y`: Target values as a 1D `Array1<f64>`.
    /// - `lambda2`: The regularization strength (λ²).
    pub fn fit(&mut self, x: &Array1<f64>, y: &Array1<f64>, lambda2: f64) {
        let n: usize = x.len();
        assert!(n > 0);
        assert_eq!(x.len(), y.len(), "x and y must have the same length");

        // mean returns None if the array is empty, so we need to unwrap it
        let x_mean: f64 = x.mean().unwrap();
        let y_mean: f64 = y.mean().unwrap();

        let num: f64 = (x - x_mean).dot(&(y - y_mean));
        let denom: f64 = (x - x_mean).mapv(|z| z.powi(2)).sum() + lambda2 * (n as f64);

        self.beta = Some(num / denom);
    }
}
// ANCHOR_END: ridge_estimator_impl_new_fit

// ANCHOR: ridge_estimator_impl_predict
impl RidgeEstimator {
    /// Predicts target values given input features.
    ///
    /// # Arguments
    /// - `x`: Input features as a 1D array.
    ///
    /// # Returns
    /// A `Result` containing the predicted values, or an error if the model
    /// has not been fitted.
    pub fn predict(&self, x: &Array1<f64>) -> Result<Array1<f64>, Box<dyn Error>> {
        match &self.beta {
            Some(beta) => Ok(*beta * x),
            None => Err("Model not fitted".into()),
        }
    }
}
// ANCHOR_END: ridge_estimator_impl_predict
}

1. ndarray instead of Vec<f64>

Rust’s standard library does not include built-in numerical computing tools. The ndarray crate provides efficient n-dimensional arrays and vectorized operations.

This example uses:

  • Array1<f64> for 1D arrays
  • .mean(), .dot(), and .mapv() for basic mathematical operations
  • Broadcasting (x * beta) for scalar–array multiplication

2. Representing model state with Option<f64>

The model's coefficient beta is only available after fitting. To represent this, we use:

#![allow(unused)]
fn main() {
beta: Option<f64>
}

This means beta can be either:

  • Some(value): if the model is trained
  • None: if the model has not been fitted yet

This eliminates the possibility of using an uninitialized value.

3. Pattern matching with match

To check whether the model has been fitted, we use pattern matching:

#![allow(unused)]
fn main() {
match self.beta {
    Some(beta) => Ok(x * beta),
    None => Err("Model not fitted".into()),
}
}

Pattern matching ensures that all possible cases of the Option type are handled explicitly. In this case, the prediction will only be computed if beta is not None, and an error is thrown otherwise.

The error handling is explain hereafter.

4. Error handling with Box<dyn Error> and .into()

Rust requires functions to return a single concrete error type. In practice, this can be achieved in several ways. Here we use a trait object:

#![allow(unused)]
fn main() {
Result<Array1<f64>, Box<dyn Error>>
}

If the function succeeds, it must return a Array1<f64>.

If it doesn't succeed, we allow the function to return any error type that implements the Error trait. The .into() method converts a string literal into a Box<dyn Error>. Internally, Rust converts:

#![allow(unused)]
fn main() {
"Model not fitted"
}

into:

#![allow(unused)]
fn main() {
Box::new(String::from("Model not fitted"))
}

It is worth emphasizing that Box<dyn Error> means that the error is heap-allocated.

5. Using #[derive(...)] for common traits

Rust allows us to automatically implement certain traits using the #[derive(...)] attribute. In this example, we write:

#![allow(unused)]
fn main() {
#[derive(Debug, Clone, Default)]
pub struct RidgeEstimator {
    beta: Option<f64>,
}
}

This provides the following implementations:

  • Debug: Enables printing the struct with {:?}, useful for debugging.
  • Clone: Allows duplicating the struct with .clone().
  • Default: Provides a default constructor (RidgeEstimator::default()), which sets beta to None.

By deriving these traits, we avoid writing repetitive code and ensure that the model is compatible with common Rust conventions, such as default initialization and copy-on-write semantics.

Advanced error handling

We could have gone even further by defining a custom ModelError type as follows.

#![allow(unused)]
fn main() {
use thiserror::Error;

#[derive(Debug, Error)]
pub enum ModelError {
    #[error("Model is not fitted yet")]
    NotFitted,
    #[error("Dimension mismatch")]
    DimensionMismatch,
}
}

This approach uses the thiserror crate to simplify the implementation of the standard Error trait.

By deriving #[derive(Debug, Error)] and annotating each variant with #[error("...")], we define error messages rightaway.

The predict function would be rewritten as:

#![allow(unused)]
fn main() {
pub fn predict(&self, x: &Array1<f64>) -> Result<f64, ModelError> {
    match &self.beta {
        Some(beta) => {
            if beta.len() != x.len() {
                return Err(ModelError::DimensionMismatch);
            }
            Ok(beta.dot(x))
        }
        None => Err(ModelError::NotFitted),
    }
}
}

Simple optimizers

This section explores how to implement a small module of optimization algorithms in Rust.

We begin by defining a common interface for optimizers and show how different strategies like gradient descent and momentum-based methods can be implemented using Rust's trait system.

Then, we explore an alternative design using enums, which can be helpful when working with simpler control flow or dynamic dispatch.

Finally, we demonstrate how to replace Vec<f64> with ndarray structures, which allows for more expressive and efficient numerical code, especially for larger-scale or matrix-based computations.

The goal is to gradually expose the design space for writing numerical algorithms idiomatically in Rust.

Optimizers using traits

This chapter illustrates how to use traits for implementing a module of optimizers. This approach is useful when you want polymorphism or when each optimizer requires its own state and logic.

It's similar to what you might do in other languages such as Python or C++, and it's a good fit for applications that involve multiple algorithm variants.

Trait definition

We define a common trait Optimizer, which describes the shared behavior of any optimizer. Let's assume that our optimizers only need a step function.

#![allow(unused)]
fn main() {
/// A trait representing an optimization algorithm that can update weights using gradients.
///
/// Optimizers must implement the `step` method, which modifies weights in place.
pub trait Optimizer {
    /// Performs a single optimization step.
    ///
    /// # Arguments
    /// - `weights`: Mutable slice of parameters to be updated.
    /// - `grads`: Slice of gradients corresponding to the weights.
    fn step(&mut self, weights: &mut [f64], grads: &[f64]);
}
}

Any type that implements this trait must provide a step method. Let's illustrate how to use this by implementing two optimizers: gradient descent with and without momentum.

Gradient descent

We first define the structure for the gradient descent algorithm. It only stores the learning rate as a f64.

#![allow(unused)]
fn main() {
/// Basic gradient descent optimizer.
///
/// Updates each weight by subtracting the gradient scaled by a learning rate.
pub struct GradientDescent {
    pub learning_rate: f64,
}
}

We then implement a constructor. In this case, it simply consists of choosing the learning rate.

#![allow(unused)]
fn main() {
impl GradientDescent {
    /// Creates a new gradient descent optimizer.
    ///
    /// # Arguments
    /// - `learning_rate`: Step size used to update weights.
    pub fn new(learning_rate: f64) -> Self {
        Self { learning_rate }
    }
}
}

Next, we implement the step method required by the Optimizer trait:

#![allow(unused)]
fn main() {
impl Optimizer for GradientDescent {
    /// Applies the gradient descent step to each weight.
    ///
    /// Each weight is updated as: `w ← w - learning_rate * grad`
    fn step(&mut self, weights: &mut [f64], grads: &[f64]) {
        for (w, g) in weights.iter_mut().zip(grads.iter()) {
            *w -= self.learning_rate * g;
        }
    }
}
}

This function updates each entry of weights by looping over the elements and applying the gradient descent update. We use elementwise operations because Vec doesn't provide built-in arithmetic methods. External crates such as ndarray or nalgebra could help write this more expressively.

Gradient descent with momentum

Now let’s implement gradient descent with momentum. The structure stores the learning rate, the momentum factor, and an internal velocity buffer:

#![allow(unused)]
fn main() {
/// Momentum-based gradient descent optimizer.
///
/// Combines current gradients with a velocity term to smooth updates.
pub struct Momentum {
    pub learning_rate: f64,
    pub momentum: f64,
    pub velocity: Vec<f64>,
}
}

We define the constructor by taking the required parameters, and we initialize the velocity to a zero vector:

#![allow(unused)]
fn main() {
impl Momentum {
    /// Creates a new momentum optimizer.
    ///
    /// # Arguments
    /// - `learning_rate`: Step size used to update weights.
    /// - `momentum`: Momentum coefficient (typically between 0.8 and 0.99).
    /// - `dim`: Dimension of the parameter vector, used to initialize velocity.
    pub fn new(learning_rate: f64, momentum: f64, dim: usize) -> Self {
        Self {
            learning_rate,
            momentum,
            velocity: vec![0.0; dim],
        }
    }
}
}

The step function is slightly more complex, as it performs elementwise operations over the weights, velocity, and gradients:

#![allow(unused)]
fn main() {
impl Optimizer for Momentum {
    /// Applies the momentum update step.
    ///
    /// Each step uses the update rule:
    /// ```text
    /// v ← momentum * v + learning_rate * grad
    /// w ← w - v
    /// ```
    fn step(&mut self, weights: &mut [f64], grads: &[f64]) {
        for ((w, g), v) in weights
            .iter_mut()
            .zip(grads.iter())
            .zip(self.velocity.iter_mut())
        {
            *v = self.momentum * *v + self.learning_rate * *g;
            *w -= *v;
        }
    }
}
}

At this point, we've defined two optimizers using structs and a shared trait. To complete the module, we define a training loop that uses any optimizer implementing the trait.

Public API

We expose the training loop in lib.rs as the public API. The function run_optimization takes a generic Optimizer, a gradient function, an initial weight vector, and a maximum number of iterations.

#![allow(unused)]
fn main() {
pub mod enum_based;
pub mod traits_and_ndarray;
pub mod traits_based;
use traits_based::optimizers::Optimizer;

pub fn run_optimization<O: Optimizer>(
    optimizer: &mut O,
    weights: &mut [f64],
    grad_fn: impl Fn(&[f64]) -> Vec<f64>,
    num_steps: usize,
) {
    for _ in 0..num_steps {
        let grads = grad_fn(weights);
        optimizer.step(weights, &grads);
    }
}
}

Example of usage

Here’s a simple example where we minimize the function in :

#![allow(unused)]
fn main() {
use traits_based::optimizers::Momentum;

fn grad_fn(w: &[f64]) -> Vec<f64> {
    w.iter().map(|wi| 2.0 * wi.powi(2)).collect()
}

let n: usize = 10;
let mut weights = vec![1.0; n];
let mut optimizer = Momentum::new(0.01, 0.9, n);

run_optimization(&mut optimizer, &mut weights, grad_fn, 100);

// Final weights after optimization
println!("{:?}", weights); 
}

Some notes:

  • Both weights and optimizer must be mutable, because we perform in-place updates.
  • We pass mutable references into run_optimization, matching its function signature.
  • The example uses a closure-based gradient function, which you can easily replace.

Optimizers as enums with internal state and methods

This chapter builds on the previous enum-based optimizer design. We now give each variant its own internal state and encapsulate behavior using methods. This pattern is useful when you want enum-based control flow with encapsulated logic.

Defining the optimizer enum

Each optimizer variant includes its own parameters and, when needed, its internal state.

#![allow(unused)]
fn main() {
/// An enum representing different optimizers with built-in state and update rules.
///
/// Supports both gradient descent and momentum-based methods.
#[derive(Debug, Clone)]
pub enum Optimizer {
    /// Gradient Descent optimizer with a fixed learning rate.
    GradientDescent { learning_rate: f64 },
    /// Momentum-based optimizer with velocity tracking.
    Momentum {
        learning_rate: f64,
        momentum: f64,
        velocity: Vec<f64>,
    },
}
}

Here, GradientDescent stores only the learning rate, while Momentum additionally stores its velocity vector.

Constructors

We define convenience constructors for each optimizer. These make usage simpler and avoid manually writing match arms.

#![allow(unused)]
fn main() {
impl Optimizer {
    /// Creates a new Gradient Descent optimizer.
    ///
    /// # Arguments
    /// - `learning_rate`: Step size for the parameter updates.
    pub fn gradient_descent(learning_rate: f64) -> Self {
        Self::GradientDescent { learning_rate }
    }

    /// Creates a new Momentum optimizer.
    ///
    /// # Arguments
    /// - `learning_rate`: Step size for the updates.
    /// - `momentum`: Momentum coefficient.
    /// - `dim`: Number of parameters (used to initialize velocity vector).
    pub fn momentum(learning_rate: f64, momentum: f64, dim: usize) -> Self {
        Self::Momentum {
            learning_rate,
            momentum,
            velocity: vec![0.0; dim],
        }
    }
}
}

This helps create optimizers in a more idiomatic and clean way.

Implementing the step method

The step method applies one optimization update depending on the variant. The method uses pattern matching to extract variant-specific behavior.

#![allow(unused)]
fn main() {
impl Optimizer {
    /// Applies a single optimization step, depending on the variant.
    ///
    /// # Arguments
    /// - `weights`: Mutable slice of model parameters to be updated.
    /// - `grads`: Gradient slice with same shape as `weights`.
    pub fn step(&mut self, weights: &mut [f64], grads: &[f64]) {
        match self {
            Optimizer::GradientDescent { learning_rate } => {
                for (w, g) in weights.iter_mut().zip(grads.iter()) {
                    *w -= *learning_rate * *g;
                }
            }
            Optimizer::Momentum {
                learning_rate,
                momentum,
                velocity,
            } => {
                for ((w, g), v) in weights
                    .iter_mut()
                    .zip(grads.iter())
                    .zip(velocity.iter_mut())
                {
                    *v = *momentum * *v + *learning_rate * *g;
                    *w -= *v;
                }
            }
        }
    }
}
}

Here, GradientDescent simply applies the learning rate times the gradient. The Momentum variant updates and stores the velocity vector before updating the weights.

Summary

This pattern can be a clean alternative to trait-based designs when you want:

  • A small number of well-known variants
  • Built-in state encapsulation
  • Exhaustive handling via pattern matching

It keeps related logic grouped under one type and can be extended easily with new optimizers.

Adding tests

In order to test our optimizers, we propose to have a look at how to implement tests and run them.

How to write tests in Rust

Tests can be included in the same file as the code using the #[cfg(test)] module. Each test function is annotated with #[test]. Inside a test, you can use assert_eq!, assert!, or similar macros to validate expected behavior.

What we test

We implemented a few tests to check:

  • That the constructors return the expected variant with the correct parameters
  • That the step method modifies weights as expected
  • That repeated calls to step update the internal state correctly (e.g., momentum's velocity)
#![allow(unused)]
fn main() {
#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_gradient_descent_constructor() {
        let opt = Optimizer::gradient_descent(0.01);
        match opt {
            Optimizer::GradientDescent { learning_rate } => {
                assert_eq!(learning_rate, 0.01);
            }
            _ => panic!("Expected GradientDescent optimizer"),
        }
    }

    #[test]
    fn test_momentum_constructor() {
        let opt = Optimizer::momentum(0.01, 0.9, 10);
        match opt {
            Optimizer::Momentum {
                learning_rate,
                momentum,
                velocity,
            } => {
                assert_eq!(learning_rate, 0.01);
                assert_eq!(momentum, 0.9);
                assert_eq!(velocity.len(), 10);
            }
            _ => panic!("Expected Momentum optimizer"),
        }
    }

    #[test]
    fn test_step_gradient_descent() {
        let mut opt = Optimizer::gradient_descent(0.1);
        let mut weights = vec![1.0, 2.0, 3.0];
        let grads = vec![0.5, 0.5, 0.5];

        opt.step(&mut weights, &grads);

        assert_eq!(weights, vec![0.95, 1.95, 2.95])
    }

    #[test]
    fn test_step_momentum() {
        let mut opt = Optimizer::momentum(0.1, 0.9, 3);
        let mut weights = vec![1.0, 2.0, 3.0];
        let grads = vec![0.5, 0.5, 0.5];

        opt.step(&mut weights, &grads);
        assert_eq!(weights, vec![0.95, 1.95, 2.95]);

        opt.step(&mut weights, &grads);
        assert!(
            weights
                .iter()
                .zip(vec![0.855, 1.855, 2.855])
                .all(|(a, b)| (*a - b).abs() < 1e-6)
        );
    }
}
}

Some notes:

  • This module is added in the same file where the optimizers are implemented.
  • The line use super::*; tells the compiler to import all the stuff available in the module.

How to run the tests

To run the tests from the command line, use:

cargo test

This will automatically find and execute all test functions in the project. You should see output like:

running 4 tests
test tests::test_gradient_descent_constructor ... ok
test tests::test_momentum_constructor ... ok
test tests::test_step_gradient_descent ... ok
test tests::test_step_momentum ... ok

If any test fails, Cargo will show which assertion failed and why.

Optimizers using ndarray

This section introduces a modular and idiomatic way to implement optimization algorithms in Rust using ndarray and traits. It is intended for readers who are already comfortable with basic Rust syntax and want to learn how to build reusable, extensible components in numerical computing.

In this example, we need to import the following types and traits:

#![allow(unused)]
fn main() {
use ndarray::Array;
use ndarray::Array1;
use ndarray::Zip;
}

You can inspect the full module that we're about the break down over here:

Click to view optimizers.rs
#![allow(unused)]
fn main() {
use ndarray::Array;
use ndarray::Array1;
use ndarray::Zip;

/// Trait for optimizers that update parameters using gradients.
///
/// Implementors must define a `run` method that takes mutable weights,
/// a gradient function, and the number of iterations to run.
// ANCHOR: trait
pub trait Optimizer {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    );
}
// ANCHOR_END: trait

/// Basic Gradient Descent (GD) optimizer.
///
/// Updates parameters in the direction of the negative gradient scaled
/// by a fixed step size.
// ANCHOR: struct_gd
pub struct GD {
    step_size: f64,
}
// ANCHOR_END: struct_gd

/// Create a new gradient descent optimizer.
///
/// # Arguments
/// - `step_size`: the learning rate.
// ANCHOR: impl_gd_new
impl GD {
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
// ANCHOR_END: impl_gd_new

/// Run the gradient descent optimizer.
///
/// For each step: `w ← w - step_size * grad(w)`
// ANCHOR: impl_gd_run
impl Optimizer for GD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            weights.zip_mut_with(&grads, |w, &g| {
                *w -= self.step_size * g;
            });
        }
    }
}
// ANCHOR_END: impl_gd_run

/// Accelerated Gradient Descent (AGD) with classical momentum.
///
/// Combines the previous velocity with the current gradient
/// to speed up convergence in convex problems.
// ANCHOR: struct_agd
pub struct AGD {
    step_size: f64,
    momentum: f64,
}
// ANCHOR_END: struct_agd

/// Create a new AGD optimizer.
///
/// # Arguments
/// - `step_size`: the learning rate.
/// - `momentum`: the momentum coefficient (typically between 0.8 and 0.99).
// ANCHOR: impl_agd_new
impl AGD {
    pub fn new(step_size: f64, momentum: f64) -> Self {
        Self {
            step_size,
            momentum,
        }
    }
}
// ANCHOR_END: impl_agd_new

/// Run AGD with momentum.
///
/// For each step:
/// ```text
/// v ← momentum * v - step_size * grad(w)
/// w ← w + v
/// ```
// ANCHOR: impl_agd_run
impl Optimizer for AGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let n: usize = weights.len();
        let mut velocity: Array1<f64> = Array::zeros(n);

        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            for ((w, g), v) in weights
                .iter_mut()
                .zip(grads.iter())
                .zip(velocity.iter_mut())
            {
                *v = self.momentum * *v - self.step_size * g;
                *w += *v;
            }
        }
    }
}
// ANCHOR_END: impl_agd_run

/// Adaptive Accelerated Gradient Descent using Nesterov's method.
///
/// This optimizer implements the variant from smooth convex optimization literature,
/// where extrapolation is based on the difference between consecutive y iterates.
///
/// References:
/// - Beck & Teboulle (2009), FISTA (but without proximal operator)
/// - Nesterov's accelerated gradient (original formulation)
// ANCHOR: AdaptiveAGD_struct
pub struct AdaptiveAGD {
    step_size: f64,
}
// ANCHOR_END: AdaptiveAGD_struct

// ANCHOR: AdaptiveAGD_impl_new
impl AdaptiveAGD {
    /// Create a new instance of AdaptiveAGD with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
// ANCHOR_END: AdaptiveAGD_impl_new

/// Run the optimizer for `n_steps` iterations.
///
/// # Arguments
/// - `weights`: mutable reference to the parameter vector (x₀), will be updated in-place.
/// - `grad_fn`: a function that computes ∇f(x) for a given x.
/// - `n_steps`: number of optimization steps to perform.
///
/// This implementation follows:
///
/// ```
/// y_{k+1} = x_k - α ∇f(x_k)
/// t_{k+1} = (1 + sqrt(1 + 4 t_k²)) / 2
/// x_{k+1} = y_{k+1} + ((t_k - 1)/t_{k+1}) * (y_{k+1} - y_k)
/// ```
// ANCHOR: AdaptiveAGD_impl_run
impl Optimizer for AdaptiveAGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
// ANCHOR_END: AdaptiveAGD_impl_run
}

1. Trait-based design

We define a trait called Optimizer to represent any optimizer that can update model weights based on gradients. In contrast to the previous sections where we mostly implemented stepfunctions, here the trait requires implementors to define a run method with the following signature:

#![allow(unused)]
fn main() {
pub trait Optimizer {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    );
}
}

This method takes:

  • A mutable reference to a vector of weights (Array1<f64>).
  • A function that computes the gradient of the loss with respect to the weights. This grad_fn function takes itself a borrowed reference to the weights &Array1<f64> and outputs a new array Array1<f64>.
  • The number of iterations to perform.

This trait run defines the whole optimization algorithm.

2. Gradient descent (GD)

The GD struct implements basic gradient descent with a fixed step size:

#![allow(unused)]
fn main() {
pub struct GD {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl GD {
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

And implements Optimizer by subtracting the gradient scaled by the step size from the weights at each iteration.

#![allow(unused)]
fn main() {
impl Optimizer for GD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            weights.zip_mut_with(&grads, |w, &g| {
                *w -= self.step_size * g;
            });
        }
    }
}
}

Some notes:

  • At each iteration, we compute the gradient with let grads = grad_fn(weights), which is fine but it reallocates a new vector at each call. If we wanted to optimize the gradient computation, we could pre-allocate a buffer outside the loop and pass a mutable reference into the gradient function to avoid repeated allocations. This would require to change the signature of the grad_fn.

  • weights.zip_mut_with(&grads, |w, &g| {{ ... }}): This is a mutable zip operation from the ndarray crate. It walks over weights and grads, applying the closure to each pair.

  • zip_mut_with is a method defined by the Zip trait, which is implemented for ArrayBase, and in particular for Array1. That’s why we can call it directly on weights.

  • In the closure statement we wrote: |w, &g| *w -= self.step_size * g;. Here, w is a mutable reference to each weight element, so we dereference it using *w to update its value. The &g in the closure means we’re pattern-matching by reference to avoid cloning or copying each f64.

3. Accelerated gradient descent (AGD)

The AGD struct extends gradient descent with a classical momentum term:

#![allow(unused)]
fn main() {
pub struct AGD {
    step_size: f64,
    momentum: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl AGD {
    pub fn new(step_size: f64, momentum: f64) -> Self {
        Self {
            step_size,
            momentum,
        }
    }
}
}

This algorithm adds momentum to classical gradient descent. Instead of updating weights using just the current gradient, it maintains a velocity vector that accumulates the influence of past gradients. This helps smooth the trajectory and accelerates convergence on convex problems.

#![allow(unused)]
fn main() {
impl Optimizer for AGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let n: usize = weights.len();
        let mut velocity: Array1<f64> = Array::zeros(n);

        for _ in 0..n_steps {
            let grads = grad_fn(weights);
            for ((w, g), v) in weights
                .iter_mut()
                .zip(grads.iter())
                .zip(velocity.iter_mut())
            {
                *v = self.momentum * *v - self.step_size * g;
                *w += *v;
            }
        }
    }
}
}

Some notes:

  • We initialize a vector of zeros to track the momentum (velocity) across steps. It has the same length as the weights. This is achieved with: let mut velocity: Array1<f64> = Array::zeros(n). Note that we could have defined the velocity as an internal state variable within the struct defintion.

  • We use a triple nested zip to unpack the values of the weights, gradients, and velocity: for ((w, g), v) in weights.iter_mut().zip(grads.iter()).zip(velocity.iter_mut()). Here,

    • weights.iter_mut() gives a mutable reference to each weight,
    • grads.iter() provides read-only access to each gradient,
    • velocity.iter_mut() allows in-place updates of the velocity vector.

    This pattern allows us to update everything in one pass, element-wise.

  • Within the nested zip closure, we update the velocity using the momentum term and current gradient: *v = self.momentum * *v - self.step_size * g;

  • The weight is updated using the new velocity: *w += *v;. Again, we dereference w because it's a mutable reference.

4. Nesterov-based adaptive AGD

The AdaptiveAGD struct implements a more advanced optimizer based on Nesterov's method and FISTA:

#![allow(unused)]
fn main() {
pub struct AdaptiveAGD {
    step_size: f64,
}
}

It has a constructor:

#![allow(unused)]
fn main() {
impl AdaptiveAGD {
    /// Create a new instance of AdaptiveAGD with a given step size.
    ///
    /// The step size should be 1 / L, where L is the Lipschitz constant
    /// of the gradient of the objective function.
    pub fn new(step_size: f64) -> Self {
        Self { step_size }
    }
}
}

This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k. This helps the optimizer "look ahead" and converge faster in smooth convex problems.

Update steps:

  • Compute a temporary point y_{k+1} by taking a gradient step from x_k.
  • Update the extrapolation coefficient t_{k+1}.
  • Combine y_{k+1} and y_k using a weighted average to get the new iterate x_{k+1}.
#![allow(unused)]
fn main() {
impl Optimizer for AdaptiveAGD {
    fn run(
        &self,
        weights: &mut Array1<f64>,
        grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>,
        n_steps: usize,
    ) {
        let mut t_k: f64 = 1.0;
        let mut y_k = weights.clone();

        for _ in 0..n_steps {
            let grad = grad_fn(weights);
            let mut y_next = weights.clone();
            Zip::from(&mut y_next).and(&grad).for_each(|y, &g| {
                *y -= self.step_size * g;
            });

            let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt());

            Zip::from(&mut *weights)
                .and(&y_next)
                .and(&y_k)
                .for_each(|x, &y1, &y0| {
                    *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0);
                });

            y_k = y_next;
            t_k = t_next;
        }
    }
}
}

Some notes:

  • We deliberately re-allocate multiple variables within the for loop (grad, y_next, t_next) but we could have pre-allocated buffers before the for loop.

  • The algorithm keeps track of two sequences: the main iterate (weights) and the extrapolated one (y_k). Before starting the for loop, we initialize y_k by cloning the weights: let mut y_k = weights.clone();.

  • The gradient is evaluated at the current weights, as in standard gradient descent: let grad = grad_fn(weights);. Since weights is a mutable reference, we can pass it straightaway to our grad_fn.

  • A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity. let mut y_next = weights.clone();.

  • We next compute: y_{k+1} = x_k - α ∇f(x_k) using an element-wise operation: Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });. This time, we rely on the Zip::from trait implement by ndarray.

  • The new weights are obtained by combining y_{k+1} and y_k. The triple zip walks over the current weights and both extrapolation points: Zip::from(&mut *weights)....

This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray.

Summary

This design demonstrates a few Rust programming techniques:

  • Traits for abstraction and polymorphism
  • Structs to encapsulate algorithm-specific state
  • Use of the ndarray crate for numerical data
  • Generic functions using closures for computing gradients