Welcome!
Welcome to Rustineers, a dive into the Rust programming language through the lens of applied mathematics and science. There are already several high-quality resources available for learning Rust:
- The Book – a comprehensive introduction to Rust.
- Rustlings – hands-on exercises to reinforce learning.
- Rust by Example – learn by studying runnable examples.
- Rust Language Cheat Sheet - really useful when you're trying to remember something without asking your favorite LLM.
You can find even more learning material at rust-lang.org.
This book is meant to be complementary to those great resources. Our goal is to learn Rust by implementing practical examples drawn from applied mathematics, including:
- Machine Learning
- Statistics and Probability
- Optimization
- Ordinary Differential Equations (ODEs)
- Partial Differential Equations (PDEs)
- And other topics from engineering and physics
Each chapter centers around a specific scientific algorithm or computational problem. We explore how to implement it idiomatically in Rust and sometimes in multiple styles.
Hopefully, we manage to go through the core concepts of Rust, namely:
- Ownership / borrowing
- Data types
- Traits
- Modules
- Error handling
- Macros
Most examples being with Rust's standard library which seems to be a solid foundation for learning both the language and its ecosystem.
Difficulty Levels
To help you navigate the material, each chapter is marked with a difficulty level using 🦀 emojis:
- 🦀 — Beginner-friendly
- 🦀🦀 — Intermediate
- 🦀🦀🦀 — Advanced
As this is a work in progress, the difficulty levels might not always be well chosen.
Roadmap
Here's an unordered list of examples of topics that could be added to the book:
- 1D Ridge regression.
- Simple first-order gradient descent algorithms.
- Multivariate regression algorithms (Ridge, Lasso, Elastic-net).
- Classification algorithms: logistic regression.
- Some clustering algorithms: K-means, Gaussian mixtures.
- Some MCMC algorithms: MH, LMC, MALA.
- Numerical methods for solving ODEs: Euler, Runge-Kutta.
- Numerical methods for solving PDEs: 1D heat equation, 2D Poisson equation.
- Optimization algorithms: gradient-based, derivative-free.
- Kernel methods: kernel Ridge, Gaussian processes, etc.
- Divergences and distances for probability distributions: KL divergence, total variation, Wasserstein.
Let us know if you have other ideas or if you want to improve any existing chapter.
Cargo 101
This chapter gives you everything you need to compile and run the examples in this book using Cargo, Rust’s official package manager and build tool.
Creating a new project
To create a new Rust library project:
cargo new my_project --lib
To create a binary project (i.e., one with a main.rs
):
cargo new my_project
Building your project
Navigate into the project directory:
cd my_project
Then build it:
cargo build
This compiles your code in debug mode (faster builds, less optimization). You’ll find the output in target/debug/
.
Running your code
If it’s a binary crate (with a main.rs
), you can run it:
cargo run
This compiles and runs your code in one go.
Testing your code
To run tests in lib.rs
or in any #[cfg(test)]
module:
cargo test
Cleaning build artifacts
Remove the target/
directory and everything in it:
cargo clean
Checking your code (without building)
cargo check
This quickly verifies your code compiles without generating the binary.
Adding dependencies
To add dependencies, open Cargo.toml
and add them under [dependencies]
:
[dependencies]
ndarray = "0.15"
Or use the command line:
cargo add ndarray
Ridge regression 1D
Here, we implement one-dimensional Ridge Regression*in several styles, using increasing levels of abstraction. It's designed as a learning path for beginners, and focuses on writing idiomatic, clear, and type-safe Rust code. We focus on minimizing this loss function:
where: is an input covariate, is the associated output, is the Ridge coefficient, is the regularization strength.
What you'll learn
This module is perfect if you're just starting with Rust and want to:
- Write beginner-friendly numerical code
- Understand how to manipulate vectors (
Vec<f64>
) and slices (&[f64]
) - Write clean and safe code using the standard library only
- Structure your functions with clear responsibilities
- Learn good Rust patterns: immutability, iterators, ownership
- How to make a simple crate out of it
- How to expose a public API
If you want to implement and run this example while you read but are not familiar with Cargo yet, have a look at Cargo 101 for how to set up your project.
Functional · std only 🦀
🧠 This implementation uses:
- Functional style
- Rust standard library only
- No
struct
,enum
, traits, or external crates
Ridge loss
In this example, we implement one-dimensional Ridge Regression loss using only the Rust standard library, without any external crates. This lets us focus on core Rust features such as slices, iterators, and type safety.
Although the loss function by itself isn't really useful to solve the Ridge problem, implementing it provides a simple and focused introduction to Rust.
Naive implementation
We now present a straightforward implementation of the Ridge regression loss function:
#![allow(unused)] fn main() { pub fn loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let y_hat: Vec<f64> = mul_scalar_vec(beta, x); let residuals: Vec<f64> = subtract_vectors(y, &y_hat); let mse: f64 = residuals.iter().map(|x| x * x).sum::<f64>() / (n as f64); mse + lambda2 * beta * beta } }
In this example, we use two helper functions that we implement ourselves. A helper function for multiplying a vector by a scalar:
#![allow(unused)] fn main() { pub fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> { vector.iter().map(|x| x * scalar).collect() } }
We also defined a helper that subtracts two slices element-wise:
#![allow(unused)] fn main() { pub fn subtract_vectors(a: &[f64], b: &[f64]) -> Vec<f64> { assert_eq!(a.len(), b.len(), "Input vectors must have the same length"); a.iter().zip(b.iter()).map(|(x, y)| x - y).collect() } }
Rather than using explicit loops, this implementation uses Rust’s iterator combinators, which the compiler optimizes into efficient code. This zero-cost abstraction keeps the code both readable and fast.
Ownership and borrowing
In Rust, every value has a single owner. When you assign a value to a new variable or pass it to a function by value, ownership is transferred (moved).
Borrowing allows you to use a value without taking ownership of it. Borrowing is done using references:
&T
is a shared (read-only) reference.&mut T
is a mutable reference.
These references allow access to data without moving it.
A function like this:
#![allow(unused)] fn main() { fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> { vector.iter().map(|x| x * scalar).collect() } }
does not take ownership of the input vector
. Instead, it borrows it for the duration of the function call. This makes it easier to reuse the input vector later.
If we instead defined:
#![allow(unused)] fn main() { fn mul_scalar_vec(scalar: f64, vector: Vec<f64>) -> Vec<f64> { ... } }
then passing a vector would move ownership:
#![allow(unused)] fn main() { let v = vec![1.0, 2.0, 3.0]; let result = mul_scalar_vec(2.0, v); // v is moved here let v2 = v; // error: value borrowed after move }
Why use &[f64]
instead of Vec<f64>
?
The type &[f64]
seems to be commonly used in function signatures because it works with both arrays and vectors.
Finally, note that:
Vec<f64>
is an owned, growable vector on the heap. The only time we return aVec<f64>
is when we allocate a new output vector, like inmul_scalar_vec
.&Vec<f64>
is a shared reference to aVec<f64>
.&[f64]
is a slice, i.e., a borrowed view into an array or vector.
In this chapter, we will mostly use these types but things can easily get more tricky.
Inlined iterator-based implementation
Let's implement the loss function in a more compact way. Instead of breaking the computation into multiple intermediate steps—like computing y_hat
, residuals
, and then squaring each residual—here we inline all computations into a single expression using iterators and closures.
This is ideal for demonstrating the expressive power of Rust's iterator API, especially once you're comfortable with basic slice handling and .map()
chaining.
#![allow(unused)] fn main() { pub fn loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { let n: usize = y.len(); let factor = n as f64; let mean_squared_error = x .iter() .zip(y.iter()) .map(|(xi, yi)| { let residual = yi - beta * xi; residual * residual }) .sum::<f64>() / factor; mean_squared_error + lambda2 * beta * beta } }
This implementation computes the mean squared error in a single iteration, minimizing allocations and abstraction overhead. In particular:
- We use
.iter().zip()
to iterate over two slices. - We define a full code block inside the
.map()
closure, which makes it easier to write intermediate expressions likelet residual = yi - beta * xi;
before returning the squared value.
Closed-form solution
The one-dimensional Ridge regression problem admits a simple closed-form solution.
Given a dataset for , and a regularization parameter , the Ridge estimator is:
This form assumes that the data has no intercept term, i.e., the model is , or equivalently, that the data is centered around zero. In practice, it is common to subtract the means of both and before computing the estimator. This removes the intercept and gives:
We now implement this solution in Rust, using only the standard library.
#![allow(unused)] fn main() { /// Computes the one-dimensional Ridge regression estimator using centered data. /// /// This version centers the input data `x` and `y` before applying the closed-form formula. /// /// # Arguments /// /// * `x` - A slice of input features. /// * `y` - A slice of target values (same length as `x`). /// * `lambda2` - The regularization parameter. /// /// # Returns /// /// * `f64` - The estimated Ridge regression coefficient. /// /// # Panics /// /// Panics if `x` and `y` do not have the same length. pub fn ridge_estimator(x: &[f64], y: &[f64], lambda2: f64) -> f64 { let n: usize = x.len(); assert_eq!(n, y.len(), "x and y must have the same length"); let x_mean: f64 = x.iter().sum::<f64>() / n as f64; let y_mean: f64 = y.iter().sum::<f64>() / n as f64; let num: f64 = x .iter() .zip(y) .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean)) .sum::<f64>(); let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64); num / denom } }
You could also express it as a single iterator chain, similar to how we implemented the loss function earlier.
Gradient descent
As an exercise, we now implement the gradient descent algorithm to optimize the Ridge regression loss. The gradient has a closed-form expression and can be efficiently computed. We implement it in two different ways as we did for the loss function.
Gradient descent implementation
Gradient descent iteratively updates the parameter β
using the gradient of the loss function:
Where η
is the learning rate, and ∇βL(β)
is the gradient of the loss.
We allow flexible experimentation by passing the gradient function as parameters:
#![allow(unused)] fn main() { pub fn gradient_descent( grad_fn: impl Fn(&[f64], &[f64], f64, f64) -> f64, x: &[f64], y: &[f64], lambda2: f64, lr: f64, n_iters: usize, init_beta: f64, ) -> f64 { let mut beta = init_beta; for _ in 0..n_iters { let grad = grad_fn(x, y, beta, lambda2); beta -= lr * grad; } beta } }
This version is generic, letting us plug in any valid grad_fn
.
Gradient function: naive implementation
This version breaks the computation into two separate steps:
- Compute the residuals
- Compute the dot product between the residuals and the inputs:
- Then assemble the get the gradient value
We first start by implementing our own dot
function by relying on iterators, map chaining, and summing the results.
#![allow(unused)] fn main() { pub fn dot(a: &[f64], b: &[f64]) -> f64 { assert_eq!(a.len(), b.len(), "Input vectors must have the same length"); a.iter().zip(b.iter()).map(|(xi, yi)| xi * yi).sum() } }
Our first implementation takes the following form:
#![allow(unused)] fn main() { pub fn grad_loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let residuals: Vec<f64> = x .iter() .zip(y.iter()) .map(|(xi, yi)| yi - beta * xi) .collect(); let residuals_dot_x = dot(&residuals, x); -2.0 * residuals_dot_x / (n as f64) + 2.0 * lambda2 * beta } }
Gradient function: inlined iterator-based implementation
In this version, we fuse the residual and gradient computation into a single iterator chain. This avoids intermediate memory allocations and takes full advantage of Rust’s zero-cost abstraction model.
#![allow(unused)] fn main() { pub fn grad_loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let grad_mse: f64 = x .iter() .zip(y.iter()) .map(|(xi, yi)| 2.0 * (yi - beta * xi) * xi) .sum::<f64>() / (n as f64); -grad_mse + 2.0 * lambda2 * beta } }
Key differences:
- The naive version allocates a temporary vectosr for the residuals and the dot product.
- The inlined version is more idiomatic Rust: it avoids allocation and achieves better performance through iterator fusion.
Putting things together
To wrap up our 1D Ridge Regression example, let's see how all the parts fit together into a real Rust crate.
Project layout
Here’s the directory structure for our ridge_regression_1d
crate:
crates/ridge_regression_1d/
├── Cargo.toml
└── src
├── analytical.rs # Closed-form solution of the Ridge estimator
├── grad_functions.rs # Gradient of the loss
├── lib.rs # Main entry point for the library
├── loss_functions.rs # Loss function implementations
├── main.rs # Binary entry point
├── optimizer.rs # Gradient descent
└── utils.rs # Utility functions (e.g., dot product)
All the functions discussed in the previous sections are implemented in analytical.rs
, loss_functions.rs
, grad_functions.rs
, utils.rs
, and optimizer.rs
. You can inspect each of these files below.
Click to view analytical.rs
#![allow(unused)] fn main() { /// Computes the one-dimensional Ridge regression estimator using centered data. /// /// This version centers the input data `x` and `y` before applying the closed-form formula. /// /// # Arguments /// /// * `x` - A slice of input features. /// * `y` - A slice of target values (same length as `x`). /// * `lambda2` - The regularization parameter. /// /// # Returns /// /// * `f64` - The estimated Ridge regression coefficient. /// /// # Panics /// /// Panics if `x` and `y` do not have the same length. pub fn ridge_estimator(x: &[f64], y: &[f64], lambda2: f64) -> f64 { let n: usize = x.len(); assert_eq!(n, y.len(), "x and y must have the same length"); let x_mean: f64 = x.iter().sum::<f64>() / n as f64; let y_mean: f64 = y.iter().sum::<f64>() / n as f64; let num: f64 = x .iter() .zip(y) .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean)) .sum::<f64>(); let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64); num / denom } }
Click to view loss_functions.rs
#![allow(unused)] fn main() { use crate::utils::{mul_scalar_vec, subtract_vectors}; /// Computes the loss function for Ridge regression (naive version). /// /// It implements it in a simple fashion by computing the mean squared error in multiple steps. /// /// # Arguments /// /// * `x` - The array of input observations /// * `y` - The array of output observations /// * `beta` - The coefficients of the linear regression /// * `lambda2` - The regularization parameter /// /// # Returns /// /// The value of the loss function /// Computes the Ridge regression loss function. /// /// This function calculates the following expression: /// /// $$ /// \mathcal{L}(\beta) = \frac{1}{2n} \sum_i (y_i - \beta x_i)^2 + \lambda \beta^2 /// $$ /// /// where: /// - `x` and `y` are the input/output observations, /// - `beta` is the linear coefficient, /// - `lambda2` is the regularization strength. /// /// # Arguments /// /// * `x` - Input features as a slice (`&[f64]`) /// * `y` - Target values as a slice (`&[f64]`) /// * `beta` - Coefficient of the regression model /// * `lambda2` - L2 regularization strength /// /// # Returns /// /// The Ridge regression loss value as `f64`. /// /// # Panics /// /// Panics if `x` and `y` do not have the same length. // ANCHOR: loss_function_naive pub fn loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let y_hat: Vec<f64> = mul_scalar_vec(beta, x); let residuals: Vec<f64> = subtract_vectors(y, &y_hat); let mse: f64 = residuals.iter().map(|x| x * x).sum::<f64>() / (n as f64); mse + lambda2 * beta * beta } // ANCHOR_END: loss_function_naive /// Computes the loss function for Ridge regression (inlined version). /// /// It implements it as a one-liner by computing the mean squared error in a single expression. /// /// # Arguments /// /// * `x` - The array of input observations /// * `y` - The array of output observations /// * `beta` - The coefficients of the linear regression /// * `lambda2` - The regularization parameter /// /// # Returns /// /// The value of the loss function // ANCHOR: loss_function_line pub fn loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { let n: usize = y.len(); let factor = n as f64; let mean_squared_error = x .iter() .zip(y.iter()) .map(|(xi, yi)| { let residual = yi - beta * xi; residual * residual }) .sum::<f64>() / factor; mean_squared_error + lambda2 * beta * beta } // ANCHOR_END: loss_function_line }
Click to view grad_functions.rs
#![allow(unused)] fn main() { use crate::utils::dot; /// Computes the gradient of the Ridge regression loss function (naive version). /// /// This implementation first explicitly computes the residuals and then performs /// a dot product between the residuals and the inputs. /// /// # Arguments /// /// * `x` - Slice of input features /// * `y` - Slice of target outputs /// * `beta` - Coefficient of the regression model /// * `lambda2` - L2 regularization strength /// /// # Returns /// /// The gradient of the loss with respect to `beta`. /// /// # Panics /// /// Panics if `x` and `y` do not have the same length. // ANCHOR: grad_loss_function_naive pub fn grad_loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let residuals: Vec<f64> = x .iter() .zip(y.iter()) .map(|(xi, yi)| yi - beta * xi) .collect(); let residuals_dot_x = dot(&residuals, x); -2.0 * residuals_dot_x / (n as f64) + 2.0 * lambda2 * beta } // ANCHOR_END: grad_loss_function_naive /// Computes the gradient of the Ridge regression loss function (inlined version). /// /// This version fuses the residual and gradient computation into a single pass /// using iterators, minimizing allocations and improving efficiency. /// /// # Arguments /// /// * `x` - Slice of input features /// * `y` - Slice of target outputs /// * `beta` - Coefficient of the regression model /// * `lambda2` - L2 regularization strength /// /// # Returns /// /// The gradient of the loss with respect to `beta`. /// /// # Panics /// /// Panics if `x` and `y` do not have the same length. // ANCHOR: grad_loss_function_inline pub fn grad_loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let grad_mse: f64 = x .iter() .zip(y.iter()) .map(|(xi, yi)| 2.0 * (yi - beta * xi) * xi) .sum::<f64>() / (n as f64); -grad_mse + 2.0 * lambda2 * beta } // ANCHOR_END: grad_loss_function_inline }
Click to view optimizer.rs
#![allow(unused)] fn main() { /// Performs gradient descent to minimize the Ridge regression loss function. /// /// # Arguments /// /// * `grad_fn` - A function that computes the gradient of the Ridge loss /// * `x` - Input features as a slice (`&[f64]`) /// * `y` - Target values as a slice (`&[f64]`) /// * `lambda2` - Regularization parameter /// * `lr` - Learning rate /// * `n_iters` - Number of gradient descent iterations /// * `init_beta` - Initial value of the regression coefficient /// /// # Returns /// /// The optimized regression coefficient `beta` after `n_iters` updates // ANCHOR: gradient_descent pub fn gradient_descent( grad_fn: impl Fn(&[f64], &[f64], f64, f64) -> f64, x: &[f64], y: &[f64], lambda2: f64, lr: f64, n_iters: usize, init_beta: f64, ) -> f64 { let mut beta = init_beta; for _ in 0..n_iters { let grad = grad_fn(x, y, beta, lambda2); beta -= lr * grad; } beta } // ANCHOR_END: gradient_descent }
Click to view utils.rs
#![allow(unused)] fn main() { /// Multiplies a vector by a scalar. /// /// # Arguments /// /// * `scalar` - A scalar multiplier /// * `vector` - A slice of f64 values /// /// # Returns /// /// A new vector containing the result of element-wise multiplication /// /// # Why `&[f64]` instead of `Vec<f64]`? /// /// We use a slice (`&[f64]`) because: /// - It's more general: works with both arrays and vectors /// - It avoids unnecessary ownership /// - It's idiomatic and Clippy-compliant // ANCHOR: mul_scalar_vec pub fn mul_scalar_vec(scalar: f64, vector: &[f64]) -> Vec<f64> { vector.iter().map(|x| x * scalar).collect() } // ANCHOR_END: mul_scalar_vec /// Subtracts two vectors element-wise. /// /// # Arguments /// /// * `a` - First input slice /// * `b` - Second input slice /// /// # Returns /// /// A new `Vec<f64>` containing the element-wise difference `a[i] - b[i]`. /// /// # Panics /// /// Panics if `a` and `b` do not have the same length. // ANCHOR: subtract_vectors pub fn subtract_vectors(a: &[f64], b: &[f64]) -> Vec<f64> { assert_eq!(a.len(), b.len(), "Input vectors must have the same length"); a.iter().zip(b.iter()).map(|(x, y)| x - y).collect() } // ANCHOR_END: subtract_vectors /// Dot product between two vectors. /// /// # Arguments /// * `a` - First input vector /// * `b` - Second input vector /// /// # Returns /// /// The float value of the dot product. /// /// # Panics /// /// Panics if `a` and `b` do have the same length. // ANCHOR: dot pub fn dot(a: &[f64], b: &[f64]) -> f64 { assert_eq!(a.len(), b.len(), "Input vectors must have the same length"); a.iter().zip(b.iter()).map(|(xi, yi)| xi * yi).sum() } // ANCHOR_END: dot }
Note that the layout can be more complicated by introducing modules and submodules. This will be covered in the next chapter when we implement a structured-oriented version of the 1D Ridge regression.
What's lib.rs
?
The lib.rs
file is the entry point for the crate as a library. This is where we declare which modules (i.e., other .rs
files) are exposed to the outside world.
#![allow(unused)] fn main() { pub mod grad_functions; pub mod loss_functions; pub mod optimizer; pub mod utils; }
Each line tells Rust:
“There is a file called
X.rs
that defines a moduleX
. Please include it in the crate.”
By default, items inside a module are private. That’s where pub
comes in.
We will dive deeper into lib.rs
in the 2.1.5 Exposing API chapter.
Why pub
?
If you want to use a function from another module or crate, you must declare it pub
(public). For example:
#![allow(unused)] fn main() { // In utils.rs pub fn dot(a: &[f64], b: &[f64]) -> f64 { ... } }
If dot
is not marked as pub
, you can’t use it outside utils.rs
, even from optimizer.rs
.
Importing between modules
Rust requires explicit imports between modules. For example, in optimizer.rs
, we want to use the dot
function from utils.rs
:
#![allow(unused)] fn main() { use crate::utils::dot; }
Here, crate
refers to the root of this library crate lib.rs
. If you check out one of the modules again (e.g., loss_functions.rs
), you'll notice that's exactly what we are doing to import functions from other modules.
Example of usage in main.rs
Now let's see how you could use the library from a binary crate:
fn main() { ridge_regression_1d::run_demo(); }
You can run this with cargo run
.
Summary
This chapter demonstrated how to:
- Implement the 1D Ridge regression in sample ways by relying on Rust standard library only.
- Organize a crate into multiple source files (modules)
- Use
pub
to expose functions - Import functions from other modules
- Call everything together from a
main.rs
This is idiomatic Rust structure and prepares you to scale beyond toy examples while staying modular and readable.
Exposing a clean API
Until now, we've manually chained together the loss, gradient, and optimization steps. This is great for learning, but in real projects, we often want a simplified and reusable API.
Rust gives us a clean way to do this by leveraging the lib.rs
file as the public interface to our crate.
lib.rs
as a public API
In your crate, lib.rs
is responsible for organizing and exposing the components we want users to interact with.
We can re-export key functions and define top-level utilities like fit
and predict
. The complete lib.rs
file now looks like this:
#![allow(unused)] fn main() { /// This module exposes the main components of the Ridge Regression 1D crate. /// /// It re-exports internal modules and defines high-level helper functions /// for training (`fit`) and predicting (`predict`) using Ridge regression. pub mod grad_functions; pub mod loss_functions; pub mod optimizer; pub mod utils; pub use grad_functions::grad_loss_function_inline; pub use optimizer::gradient_descent; /// Fits a Ridge regression model using gradient descent. /// /// # Arguments /// /// * `x` - Input features (`&[f64]`) /// * `y` - Target values (`&[f64]`) /// * `lambda2` - Regularization strength /// * `lr` - Learning rate /// * `n_iters` - Number of gradient descent iterations /// * `init_beta` - Initial value of the coefficient /// /// # Returns /// /// The optimized coefficient `beta` as `f64`. pub fn fit(x: &[f64], y: &[f64], lambda2: f64, lr: f64, n_iters: usize, init_beta: f64) -> f64 { gradient_descent( grad_loss_function_inline, x, y, lambda2, lr, n_iters, init_beta, ) } /// Predicts output values using a trained Ridge regression coefficient. /// /// # Arguments /// /// * `x` - Input features (`&[f64]`) /// * `beta` - Trained coefficient /// /// # Returns /// /// A `Vec<f64>` with predicted values. pub fn predict(x: &[f64], beta: f64) -> Vec<f64> { x.iter().map(|xi| xi * beta).collect() } }
Everything declared pub
is available to the user.
Example of usage
You can update your binary entry point to try out the public API.
use ridge_regression_1d::{fit, predict}; fn main() { let x = vec![1.0, 2.0, 3.0]; let y = vec![2.0, 4.0, 6.0]; let beta = fit(&x, &y, 0.1, 0.01, 1000, 0.0); let preds = predict(&x, beta); println!("Learned beta: {}", beta); println!("Predictions: {:?}", preds); }
Structured · std only 🦀
🦀 This implementation uses:
- Structs and
impl
blocks- Rust standard library only
- No external crates
- Optional use of traits and encapsulation
Closed-form solution
We now present a structured implementation of the 1D Ridge estimator using a dedicated RidgeEstimator
struct. We implement the same methods, i.e.,
- fit: the method to compute the optimal from data and the regularization parameter ,
- predict: the method to compute predictions from new data,
but rely on Rust's struct
and impl
to define a new type. We also an additional method new
, a constructor to initialize the estimator with an initial value of .
Struct definition
This simple struct stores the estimated coefficient as a field.
#![allow(unused)] fn main() { pub struct RidgeEstimator { beta: f64, } }
Constructor and methods
Once the struct
is defined, we can implement the constructor new
, and the methods fit
and predict
.
#![allow(unused)] fn main() { impl RidgeEstimator { pub fn new(init_beta: f64) -> Self { Self { beta: init_beta } } fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) { let n: usize = x.len(); assert_eq!(n, y.len(), "x and y must have the same length"); let x_mean: f64 = x.iter().sum::<f64>() / n as f64; let y_mean: f64 = y.iter().sum::<f64>() / n as f64; let num: f64 = x .iter() .zip(y) .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean)) .sum::<f64>(); let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64); self.beta = num / denom; } fn predict(&self, x: &[f64]) -> Vec<f64> { x.iter().map(|xi| self.beta * xi).collect() } } }
Note that we can decompose the implementation into as many blocks as we want:
#![allow(unused)] fn main() { impl RidgeEstimator { pub fn new(init_beta: f64) -> Self { Self { beta: init_beta } } } impl RidgeEstimator { fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) { ... } } impl RidgeEstimator { fn predict(&self, x: &[f64]) -> Vec<f64> { ... } } }
This can be useful when dealing with complex methods.
Example of usage
Here is how we can use our new Ridge estimator:
fn main() { let x = vec![1.0, 2.0, 3.0, 4.0]; let y = vec![2.1, 4.1, 6.2, 8.3]; let lambda = 0.1; let mut model = RidgeEstimator::new(0.0); model.fit(&x, &y, lambda); let predictions = model.predict(&x); println!("Predictions: {:?}", predictions); }
Gradient descent
As an another illustration of struct
and impl
, let's tackle the gradient descent method for the Ridge regression again. We use the following structure:
#![allow(unused)] fn main() { pub struct RidgeGradientDescent { beta: f64, n_iters: usize, lr: f64, } }
This struct stores the current coefficient , the number of iterations to run, and the learning rate. We can subsequently implement the constructor and all the methods required to perform gradient descent.
#![allow(unused)] fn main() { impl RidgeGradientDescent { pub fn new(n_iters: usize, lr: f64, init_beta: f64) -> Self { Self { beta: init_beta, n_iters, lr, } } fn grad_function(&self, x: &[f64], y: &[f64], lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let grad_mse: f64 = x .iter() .zip(y.iter()) .map(|(xi, yi)| { let error = yi - self.beta * xi; 2.0 * error * xi }) .sum::<f64>() / (n as f64); -grad_mse + 2.0 * lambda2 * self.beta } fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) { for _ in 0..self.n_iters { let grad = self.grad_function(x, y, lambda2); self.beta -= self.lr * grad; } } fn predict(&self, x: &[f64]) -> Vec<f64> { x.iter().map(|xi| self.beta * xi).collect() } } }
Example of usage
Here is how we can use our new Ridge estimator:
fn main() { let x = vec![1.0, 2.0, 3.0, 4.0]; let y = vec![2.1, 4.1, 6.2, 8.3]; let mut model = RidgeGradientDescent::new(1000, 0.01, 0.0); model.fit(&x, &y, 0.1); let predictions = model.predict(&x); println!("Predictions: {:?}", predictions); }
Base Ridge model using traits
We implement two different Ridge estimators in the previous sections. While these implementations solve the same problem, their fit logic is different. To unify their interface and promote code reuse, we can leverage Rust's trait mechanism.
We define a common trait RidgeModel
, which describes the shared behavior of any Ridge estimator:
#![allow(unused)] fn main() { pub trait RidgeModel { fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64); fn predict(&self, x: &[f64]) -> Vec<f64>; } }
Any type that implements this trait must provide a fit method to train the model and a predict method to make predictions.
Both implementations use the same logic to produce predictions from a scalar . We can move this logic to a shared helper function:
#![allow(unused)] fn main() { fn predict_from_beta(beta: f64, x: &[f64]) -> Vec<f64> { x.iter().map(|xi| beta * xi).collect() } }
Trait implementation: gradient descent method
Recall that our RidgeGradientDescent
type is defined as follows:
#![allow(unused)] fn main() { pub struct RidgeGradientDescent { beta: f64, n_iters: usize, lr: f64, } }
We still need to define the constructor and the gradient function:
#![allow(unused)] fn main() { impl RidgeGradientDescent { pub fn new(n_iters: usize, lr: f64, init_beta: f64) -> Self { Self { beta: init_beta, n_iters, lr, } } fn grad_function(&self, x: &[f64], y: &[f64], lambda2: f64) -> f64 { assert_eq!(x.len(), y.len(), "x and y must have the same length"); let n: usize = x.len(); let grad_mse: f64 = x .iter() .zip(y.iter()) .map(|(xi, yi)| { let error = yi - self.beta * xi; 2.0 * error * xi }) .sum::<f64>() / (n as f64); -grad_mse + 2.0 * lambda2 * self.beta } } }
Once this is done, we need to implement the required methods to be a RidgeModel
:
#![allow(unused)] fn main() { impl RidgeModel for RidgeGradientDescent { fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) { for _ in 0..self.n_iters { let grad = self.grad_function(x, y, lambda2); self.beta -= self.lr * grad; } } fn predict(&self, x: &[f64]) -> Vec<f64> { predict_from_beta(self.beta, x) } } }
Trait implementation: closed-form estimator
We do the same for the RidgeEstimator
that uses the analytical formula:
#![allow(unused)] fn main() { pub struct RidgeEstimator { beta: f64, } impl RidgeEstimator { pub fn new(init_beta: f64) -> Self { Self { beta: init_beta } } } impl RidgeModel for RidgeEstimator { fn fit(&mut self, x: &[f64], y: &[f64], lambda2: f64) { let n: usize = x.len(); assert_eq!(n, y.len(), "x and y must have the same length"); let x_mean: f64 = x.iter().sum::<f64>() / n as f64; let y_mean: f64 = y.iter().sum::<f64>() / n as f64; let num: f64 = x .iter() .zip(y) .map(|(xi, yi)| (xi - x_mean) * (yi - y_mean)) .sum::<f64>(); let denom: f64 = x.iter().map(|xi| (xi - x_mean).powi(2)).sum::<f64>() + lambda2 * (n as f64); self.beta = num / denom; } fn predict(&self, x: &[f64]) -> Vec<f64> { predict_from_beta(self.beta, x) } } }
That's it ! The usage remains the same but we slighly refactored our code.
Closed-form estimator with generic types and trait bounds
What are generics?
Generics let you write code that works with many types, not just one.
Instead of writing:
#![allow(unused)] fn main() { struct RidgeEstimator { beta: f64, } }
You can write:
#![allow(unused)] fn main() { struct RidgeEstimator<F> { beta: F, } }
Here, F
is a type parameter — it could be f32
, f64
, or another type. In Rust, generic types have no behavior by default.
#![allow(unused)] fn main() { fn sum(xs: &[F]) -> F { xs.iter().sum() // This will not compile } }
The compiler gives an error: "F
might not implement Sum
, so I don’t know how to .sum()
over it."
Trait bounds
To fix that, we must tell the compiler which traits F
should implement.
For example:
#![allow(unused)] fn main() { impl<F: Float + Sum> RidgeModel<F> for GenRidgeEstimator<F> { ... } }
This means:
F
must implementFloat
(it must behave like a floating point number: supportpowi
,abs
, etc.)F
must implementSum
(so we can sum an iterator ofF
)
This allows code like:
#![allow(unused)] fn main() { let mean = xs.iter().copied().sum::<F>() / F::from(xs.len()).unwrap(); }
Using generic bounds allows the estimator to work with f32
, f64
, or any numeric type implementing Float
. The compiler can generate specialized code for each concrete type.
Generic Ridge estimator
Our Ridge estimator that works with f32
and f64
takes this form:
#![allow(unused)] fn main() { use num_traits::Float; use std::iter::Sum; /// A trait representing a generic Ridge regression model. /// /// The model must support fitting to training data and predicting new outputs. // ANCHOR: ridge_model_trait pub trait RidgeModel<F: Float + Sum> { /// Fits the model to the given data using Ridge regression. fn fit(&mut self, x: &[F], y: &[F], lambda2: F); /// Predicts output values for a slice of new input features. fn predict(&self, x: &[F]) -> Vec<F>; } // ANCHOR_END: ridge_model_trait /// A generic Ridge regression estimator using a single coefficient `beta`. /// /// This implementation assumes a linear relationship between `x` and `y` /// and performs scalar Ridge regression (1D). // ANCHOR: gen_ridge_estimator pub struct GenRidgeEstimator<F: Float + Sum> { beta: F, } // ANCHOR_END: gen_ridge_estimator // ANCHOR: gen_ridge_estimator_impl impl<F: Float + Sum> GenRidgeEstimator<F> { /// Creates a new estimator with the given initial beta coefficient. pub fn new(init_beta: F) -> Self { Self { beta: init_beta } } } // ANCHOR_END: gen_ridge_estimator_impl // ANCHOR: gen_ridge_estimator_trait_impl impl<F: Float + Sum> RidgeModel<F> for GenRidgeEstimator<F> { /// Fits the Ridge regression model to 1D data using closed-form solution. /// /// This method computes the regression coefficient `beta` by minimizing /// the Ridge-regularized least squares loss. /// /// # Arguments /// - `x`: Input features. /// - `y`: Target values. /// - `lambda2`: The regularization parameter (λ²). fn fit(&mut self, x: &[F], y: &[F], lambda2: F) { let n: usize = x.len(); let n_f: F = F::from(n).unwrap(); assert_eq!(x.len(), y.len(), "x and y must have the same length"); let x_mean: F = x.iter().copied().sum::<F>() / n_f; let y_mean: F = y.iter().copied().sum::<F>() / n_f; let num: F = x .iter() .zip(y.iter()) .map(|(xi, yi)| (*xi - x_mean) * (*yi - y_mean)) .sum::<F>(); let denom: F = x.iter().map(|xi| (*xi - x_mean).powi(2)).sum::<F>() + lambda2 * n_f; self.beta = num / denom; } /// Applies the trained model to input features to generate predictions. /// /// # Arguments /// - `x`: Input features to predict from. /// /// # Returns /// A vector of predicted values, one for each input in `x`. fn predict(&self, x: &[F]) -> Vec<F> { x.iter().map(|xi| *xi * self.beta).collect() } } // ANCHOR_END: gen_ridge_estimator_trait_impl }
Notice that the trait bounds <F: Float + Sum> RidgeModel<F>
are defined after the name of a trait
or struct
, or right next to an impl
.
Summary
- Generics support type-flexible code.
- Trait bounds like
<F: Float + Sum>
constrain what operations are valid. - Without
Sum
, the compiler does not allow.sum()
on iterators ofF
.
Try removing Sum
from the bound:
#![allow(unused)] fn main() { impl<F: Float> RidgeModel<F> for GenRidgeEstimator<F> }
And keep a call to .sum()
. The compiler should complain:
error[E0599]: the method `sum` exists for iterator `std::slice::Iter<'_, F>`,
but its trait bounds were not satisfied
To resolve this, add + Sum
to the bound.
Closed-form solution with ndarray
, Option
, and error handling
This section introduces several important features of the language:
- Using the
ndarray
crate for numerical arrays - Representing optional values with
Option
,Some
, andNone
- Using pattern matching with
match
- Handling errors using
Box<dyn Error>
and.into()
- Automatically deriving common trait implementations using #[derive(...)]
Motivation
In previous sections, we worked with Vec<f64>
and returned plain values. In practice, we might need:
- Efficient linear algebra tools, provided by external crates such as
ndarray
andnalgebra
- A way to represent "fitted" or "not fitted" states, using
Option<f64>
- A way to return errors when something goes wrong, using
Result<_, Box<dyn Error>>
- Automatically implementing traits like
Debug
,Clone
, andDefault
to simplify testing, debugging, and construction
We combine these in the implementation of the analytical RidgeEstimator
.
The full code
#![allow(unused)] fn main() { use ndarray::Array1; use std::error::Error; /// A Ridge regression estimator using `ndarray` for vectorized operations. /// /// This version supports fitting and predicting using `Array1<f64>` arrays. /// The coefficient `beta` is stored as an `Option<f64>`, allowing the model /// to represent both fitted and unfitted states. // ANCHOR: struct #[derive(Debug, Clone, Default)] pub struct RidgeEstimator { beta: Option<f64>, } // ANCHOR_END: struct // ANCHOR: ridge_estimator_impl_new_fit impl RidgeEstimator { /// Creates a new, unfitted Ridge estimator. /// /// # Returns /// A `RidgeEstimator` with `beta` set to `None`. pub fn new() -> Self { Self { beta: None } } /// Fits the Ridge regression model using 1D input and output arrays. /// /// This function computes the coefficient `beta` using the closed-form /// solution with L2 regularization. /// /// # Arguments /// - `x`: Input features as a 1D `Array1<f64>`. /// - `y`: Target values as a 1D `Array1<f64>`. /// - `lambda2`: The regularization strength (λ²). pub fn fit(&mut self, x: &Array1<f64>, y: &Array1<f64>, lambda2: f64) { let n: usize = x.len(); assert!(n > 0); assert_eq!(x.len(), y.len(), "x and y must have the same length"); // mean returns None if the array is empty, so we need to unwrap it let x_mean: f64 = x.mean().unwrap(); let y_mean: f64 = y.mean().unwrap(); let num: f64 = (x - x_mean).dot(&(y - y_mean)); let denom: f64 = (x - x_mean).mapv(|z| z.powi(2)).sum() + lambda2 * (n as f64); self.beta = Some(num / denom); } } // ANCHOR_END: ridge_estimator_impl_new_fit // ANCHOR: ridge_estimator_impl_predict impl RidgeEstimator { /// Predicts target values given input features. /// /// # Arguments /// - `x`: Input features as a 1D array. /// /// # Returns /// A `Result` containing the predicted values, or an error if the model /// has not been fitted. pub fn predict(&self, x: &Array1<f64>) -> Result<Array1<f64>, Box<dyn Error>> { match &self.beta { Some(beta) => Ok(*beta * x), None => Err("Model not fitted".into()), } } } // ANCHOR_END: ridge_estimator_impl_predict }
1. ndarray
instead of Vec<f64>
Rust’s standard library does not include built-in numerical computing tools. The ndarray
crate provides efficient n-dimensional arrays and vectorized operations.
This example uses:
Array1<f64>
for 1D arrays.mean()
,.dot()
, and.mapv()
for basic mathematical operations- Broadcasting (
x * beta
) for scalar–array multiplication
2. Representing model state with Option<f64>
The model's coefficient beta
is only available after fitting. To represent this, we use:
#![allow(unused)] fn main() { beta: Option<f64> }
This means beta
can be either:
Some(value)
: if the model is trainedNone
: if the model has not been fitted yet
This eliminates the possibility of using an uninitialized value.
3. Pattern matching with match
To check whether the model has been fitted, we use pattern matching:
#![allow(unused)] fn main() { match self.beta { Some(beta) => Ok(x * beta), None => Err("Model not fitted".into()), } }
Pattern matching ensures that all possible cases of the Option
type are handled explicitly. In this case, the prediction will only be computed if beta
is not None, and an error is thrown otherwise.
The error handling is explain hereafter.
4. Error handling with Box<dyn Error>
and .into()
Rust requires functions to return a single concrete error type. In practice, this can be achieved in several ways. Here we use a trait object:
#![allow(unused)] fn main() { Result<Array1<f64>, Box<dyn Error>> }
If the function succeeds, it must return a Array1<f64>
.
If it doesn't succeed, we allow the function to return any error type that implements the Error
trait. The .into()
method converts a string literal into a Box<dyn Error>
. Internally, Rust converts:
#![allow(unused)] fn main() { "Model not fitted" }
into:
#![allow(unused)] fn main() { Box::new(String::from("Model not fitted")) }
It is worth emphasizing that Box<dyn Error>
means that the error is heap-allocated.
5. Using #[derive(...)]
for common traits
Rust allows us to automatically implement certain traits using the #[derive(...)]
attribute. In this example, we write:
#![allow(unused)] fn main() { #[derive(Debug, Clone, Default)] pub struct RidgeEstimator { beta: Option<f64>, } }
This provides the following implementations:
Debug
: Enables printing the struct with{:?}
, useful for debugging.Clone
: Allows duplicating the struct with.clone()
.Default
: Provides a default constructor (RidgeEstimator::default()
), which setsbeta
toNone
.
By deriving these traits, we avoid writing repetitive code and ensure that the model is compatible with common Rust conventions, such as default initialization and copy-on-write semantics.
We could have gone even further by defining a custom ModelError
type as follows.
#![allow(unused)] fn main() { use thiserror::Error; #[derive(Debug, Error)] pub enum ModelError { #[error("Model is not fitted yet")] NotFitted, #[error("Dimension mismatch")] DimensionMismatch, } }
This approach uses the thiserror
crate to simplify the implementation of the standard Error
trait.
By deriving #[derive(Debug, Error)]
and annotating each variant with #[error("...")]
, we define error messages rightaway.
The predict function would be rewritten as:
#![allow(unused)] fn main() { pub fn predict(&self, x: &Array1<f64>) -> Result<f64, ModelError> { match &self.beta { Some(beta) => { if beta.len() != x.len() { return Err(ModelError::DimensionMismatch); } Ok(beta.dot(x)) } None => Err(ModelError::NotFitted), } } }
Simple optimizers
This section explores how to implement a small module of optimization algorithms in Rust.
We begin by defining a common interface for optimizers and show how different strategies like gradient descent and momentum-based methods can be implemented using Rust's trait
system.
Then, we explore an alternative design using enums, which can be helpful when working with simpler control flow or dynamic dispatch.
Finally, we demonstrate how to replace Vec<f64>
with ndarray
structures, which allows for more expressive and efficient numerical code, especially for larger-scale or matrix-based computations.
The goal is to gradually expose the design space for writing numerical algorithms idiomatically in Rust.
Optimizers using traits
This chapter illustrates how to use traits for implementing a module of optimizers. This approach is useful when you want polymorphism or when each optimizer requires its own state and logic.
It's similar to what you might do in other languages such as Python or C++, and it's a good fit for applications that involve multiple algorithm variants.
Trait definition
We define a common trait Optimizer
, which describes the shared behavior of any optimizer. Let's assume that our optimizers only need a step
function.
#![allow(unused)] fn main() { /// A trait representing an optimization algorithm that can update weights using gradients. /// /// Optimizers must implement the `step` method, which modifies weights in place. pub trait Optimizer { /// Performs a single optimization step. /// /// # Arguments /// - `weights`: Mutable slice of parameters to be updated. /// - `grads`: Slice of gradients corresponding to the weights. fn step(&mut self, weights: &mut [f64], grads: &[f64]); } }
Any type that implements this trait must provide a step
method. Let's illustrate how to use this by implementing two optimizers: gradient descent with and without momentum.
Gradient descent
We first define the structure for the gradient descent algorithm. It only stores the learning rate as a f64
.
#![allow(unused)] fn main() { /// Basic gradient descent optimizer. /// /// Updates each weight by subtracting the gradient scaled by a learning rate. pub struct GradientDescent { pub learning_rate: f64, } }
We then implement a constructor. In this case, it simply consists of choosing the learning rate.
#![allow(unused)] fn main() { impl GradientDescent { /// Creates a new gradient descent optimizer. /// /// # Arguments /// - `learning_rate`: Step size used to update weights. pub fn new(learning_rate: f64) -> Self { Self { learning_rate } } } }
Next, we implement the step
method required by the Optimizer
trait:
#![allow(unused)] fn main() { impl Optimizer for GradientDescent { /// Applies the gradient descent step to each weight. /// /// Each weight is updated as: `w ← w - learning_rate * grad` fn step(&mut self, weights: &mut [f64], grads: &[f64]) { for (w, g) in weights.iter_mut().zip(grads.iter()) { *w -= self.learning_rate * g; } } } }
This function updates each entry of weights
by looping over the elements and applying the gradient descent update. We use elementwise operations because Vec
doesn't provide built-in arithmetic methods. External crates such as ndarray
or nalgebra
could help write this more expressively.
Gradient descent with momentum
Now let’s implement gradient descent with momentum. The structure stores the learning rate, the momentum factor, and an internal velocity buffer:
#![allow(unused)] fn main() { /// Momentum-based gradient descent optimizer. /// /// Combines current gradients with a velocity term to smooth updates. pub struct Momentum { pub learning_rate: f64, pub momentum: f64, pub velocity: Vec<f64>, } }
We define the constructor by taking the required parameters, and we initialize the velocity to a zero vector:
#![allow(unused)] fn main() { impl Momentum { /// Creates a new momentum optimizer. /// /// # Arguments /// - `learning_rate`: Step size used to update weights. /// - `momentum`: Momentum coefficient (typically between 0.8 and 0.99). /// - `dim`: Dimension of the parameter vector, used to initialize velocity. pub fn new(learning_rate: f64, momentum: f64, dim: usize) -> Self { Self { learning_rate, momentum, velocity: vec![0.0; dim], } } } }
The step
function is slightly more complex, as it performs elementwise operations over the weights, velocity, and gradients:
#![allow(unused)] fn main() { impl Optimizer for Momentum { /// Applies the momentum update step. /// /// Each step uses the update rule: /// ```text /// v ← momentum * v + learning_rate * grad /// w ← w - v /// ``` fn step(&mut self, weights: &mut [f64], grads: &[f64]) { for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(self.velocity.iter_mut()) { *v = self.momentum * *v + self.learning_rate * *g; *w -= *v; } } } }
At this point, we've defined two optimizers using structs and a shared trait. To complete the module, we define a training loop that uses any optimizer implementing the trait.
Public API
We expose the training loop in lib.rs
as the public API. The function run_optimization
takes a generic Optimizer
, a gradient function, an initial weight vector, and a maximum number of iterations.
#![allow(unused)] fn main() { pub mod enum_based; pub mod traits_and_ndarray; pub mod traits_based; use traits_based::optimizers::Optimizer; pub fn run_optimization<O: Optimizer>( optimizer: &mut O, weights: &mut [f64], grad_fn: impl Fn(&[f64]) -> Vec<f64>, num_steps: usize, ) { for _ in 0..num_steps { let grads = grad_fn(weights); optimizer.step(weights, &grads); } } }
Example of usage
Here’s a simple example where we minimize the function in :
#![allow(unused)] fn main() { use traits_based::optimizers::Momentum; fn grad_fn(w: &[f64]) -> Vec<f64> { w.iter().map(|wi| 2.0 * wi.powi(2)).collect() } let n: usize = 10; let mut weights = vec![1.0; n]; let mut optimizer = Momentum::new(0.01, 0.9, n); run_optimization(&mut optimizer, &mut weights, grad_fn, 100); // Final weights after optimization println!("{:?}", weights); }
Some notes:
- Both
weights
andoptimizer
must be mutable, because we perform in-place updates. - We pass mutable references into
run_optimization
, matching its function signature. - The example uses a closure-based gradient function, which you can easily replace.
Optimizers as enums with internal state and methods
This chapter builds on the previous enum-based optimizer design. We now give each variant its own internal state and encapsulate behavior using methods. This pattern is useful when you want enum-based control flow with encapsulated logic.
Defining the optimizer enum
Each optimizer variant includes its own parameters and, when needed, its internal state.
#![allow(unused)] fn main() { /// An enum representing different optimizers with built-in state and update rules. /// /// Supports both gradient descent and momentum-based methods. #[derive(Debug, Clone)] pub enum Optimizer { /// Gradient Descent optimizer with a fixed learning rate. GradientDescent { learning_rate: f64 }, /// Momentum-based optimizer with velocity tracking. Momentum { learning_rate: f64, momentum: f64, velocity: Vec<f64>, }, } }
Here, GradientDescent
stores only the learning rate, while Momentum
additionally stores its velocity vector.
Constructors
We define convenience constructors for each optimizer. These make usage simpler and avoid manually writing match arms.
#![allow(unused)] fn main() { impl Optimizer { /// Creates a new Gradient Descent optimizer. /// /// # Arguments /// - `learning_rate`: Step size for the parameter updates. pub fn gradient_descent(learning_rate: f64) -> Self { Self::GradientDescent { learning_rate } } /// Creates a new Momentum optimizer. /// /// # Arguments /// - `learning_rate`: Step size for the updates. /// - `momentum`: Momentum coefficient. /// - `dim`: Number of parameters (used to initialize velocity vector). pub fn momentum(learning_rate: f64, momentum: f64, dim: usize) -> Self { Self::Momentum { learning_rate, momentum, velocity: vec![0.0; dim], } } } }
This helps create optimizers in a more idiomatic and clean way.
Implementing the step method
The step
method applies one optimization update depending on the variant. The method uses pattern matching to extract variant-specific behavior.
#![allow(unused)] fn main() { impl Optimizer { /// Applies a single optimization step, depending on the variant. /// /// # Arguments /// - `weights`: Mutable slice of model parameters to be updated. /// - `grads`: Gradient slice with same shape as `weights`. pub fn step(&mut self, weights: &mut [f64], grads: &[f64]) { match self { Optimizer::GradientDescent { learning_rate } => { for (w, g) in weights.iter_mut().zip(grads.iter()) { *w -= *learning_rate * *g; } } Optimizer::Momentum { learning_rate, momentum, velocity, } => { for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(velocity.iter_mut()) { *v = *momentum * *v + *learning_rate * *g; *w -= *v; } } } } } }
Here, GradientDescent
simply applies the learning rate times the gradient. The Momentum
variant updates and stores the velocity vector before updating the weights.
Summary
This pattern can be a clean alternative to trait-based designs when you want:
- A small number of well-known variants
- Built-in state encapsulation
- Exhaustive handling via pattern matching
It keeps related logic grouped under one type and can be extended easily with new optimizers.
Adding tests
In order to test our optimizers, we propose to have a look at how to implement tests and run them.
How to write tests in Rust
Tests can be included in the same file as the code using the #[cfg(test)]
module. Each test function is annotated with #[test]
. Inside a test, you can use assert_eq!
, assert!
, or similar macros to validate expected behavior.
What we test
We implemented a few tests to check:
- That the constructors return the expected variant with the correct parameters
- That the
step
method modifies weights as expected - That repeated calls to
step
update the internal state correctly (e.g., momentum's velocity)
#![allow(unused)] fn main() { #[cfg(test)] mod tests { use super::*; #[test] fn test_gradient_descent_constructor() { let opt = Optimizer::gradient_descent(0.01); match opt { Optimizer::GradientDescent { learning_rate } => { assert_eq!(learning_rate, 0.01); } _ => panic!("Expected GradientDescent optimizer"), } } #[test] fn test_momentum_constructor() { let opt = Optimizer::momentum(0.01, 0.9, 10); match opt { Optimizer::Momentum { learning_rate, momentum, velocity, } => { assert_eq!(learning_rate, 0.01); assert_eq!(momentum, 0.9); assert_eq!(velocity.len(), 10); } _ => panic!("Expected Momentum optimizer"), } } #[test] fn test_step_gradient_descent() { let mut opt = Optimizer::gradient_descent(0.1); let mut weights = vec![1.0, 2.0, 3.0]; let grads = vec![0.5, 0.5, 0.5]; opt.step(&mut weights, &grads); assert_eq!(weights, vec![0.95, 1.95, 2.95]) } #[test] fn test_step_momentum() { let mut opt = Optimizer::momentum(0.1, 0.9, 3); let mut weights = vec![1.0, 2.0, 3.0]; let grads = vec![0.5, 0.5, 0.5]; opt.step(&mut weights, &grads); assert_eq!(weights, vec![0.95, 1.95, 2.95]); opt.step(&mut weights, &grads); assert!( weights .iter() .zip(vec![0.855, 1.855, 2.855]) .all(|(a, b)| (*a - b).abs() < 1e-6) ); } } }
Some notes:
- This module is added in the same file where the optimizers are implemented.
- The line
use super::*;
tells the compiler to import all the stuff available in the module.
How to run the tests
To run the tests from the command line, use:
cargo test
This will automatically find and execute all test functions in the project. You should see output like:
running 4 tests
test tests::test_gradient_descent_constructor ... ok
test tests::test_momentum_constructor ... ok
test tests::test_step_gradient_descent ... ok
test tests::test_step_momentum ... ok
If any test fails, Cargo will show which assertion failed and why.
Optimizers using ndarray
This section introduces a modular and idiomatic way to implement optimization algorithms in Rust using ndarray
and traits. It is intended for readers who are already comfortable with basic Rust syntax and want to learn how to build reusable, extensible components in numerical computing.
In this example, we need to import the following types and traits:
#![allow(unused)] fn main() { use ndarray::Array; use ndarray::Array1; use ndarray::Zip; }
You can inspect the full module that we're about the break down over here:
Click to view optimizers.rs
#![allow(unused)] fn main() { use ndarray::Array; use ndarray::Array1; use ndarray::Zip; /// Trait for optimizers that update parameters using gradients. /// /// Implementors must define a `run` method that takes mutable weights, /// a gradient function, and the number of iterations to run. // ANCHOR: trait pub trait Optimizer { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ); } // ANCHOR_END: trait /// Basic Gradient Descent (GD) optimizer. /// /// Updates parameters in the direction of the negative gradient scaled /// by a fixed step size. // ANCHOR: struct_gd pub struct GD { step_size: f64, } // ANCHOR_END: struct_gd /// Create a new gradient descent optimizer. /// /// # Arguments /// - `step_size`: the learning rate. // ANCHOR: impl_gd_new impl GD { pub fn new(step_size: f64) -> Self { Self { step_size } } } // ANCHOR_END: impl_gd_new /// Run the gradient descent optimizer. /// /// For each step: `w ← w - step_size * grad(w)` // ANCHOR: impl_gd_run impl Optimizer for GD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { for _ in 0..n_steps { let grads = grad_fn(weights); weights.zip_mut_with(&grads, |w, &g| { *w -= self.step_size * g; }); } } } // ANCHOR_END: impl_gd_run /// Accelerated Gradient Descent (AGD) with classical momentum. /// /// Combines the previous velocity with the current gradient /// to speed up convergence in convex problems. // ANCHOR: struct_agd pub struct AGD { step_size: f64, momentum: f64, } // ANCHOR_END: struct_agd /// Create a new AGD optimizer. /// /// # Arguments /// - `step_size`: the learning rate. /// - `momentum`: the momentum coefficient (typically between 0.8 and 0.99). // ANCHOR: impl_agd_new impl AGD { pub fn new(step_size: f64, momentum: f64) -> Self { Self { step_size, momentum, } } } // ANCHOR_END: impl_agd_new /// Run AGD with momentum. /// /// For each step: /// ```text /// v ← momentum * v - step_size * grad(w) /// w ← w + v /// ``` // ANCHOR: impl_agd_run impl Optimizer for AGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let n: usize = weights.len(); let mut velocity: Array1<f64> = Array::zeros(n); for _ in 0..n_steps { let grads = grad_fn(weights); for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(velocity.iter_mut()) { *v = self.momentum * *v - self.step_size * g; *w += *v; } } } } // ANCHOR_END: impl_agd_run /// Adaptive Accelerated Gradient Descent using Nesterov's method. /// /// This optimizer implements the variant from smooth convex optimization literature, /// where extrapolation is based on the difference between consecutive y iterates. /// /// References: /// - Beck & Teboulle (2009), FISTA (but without proximal operator) /// - Nesterov's accelerated gradient (original formulation) // ANCHOR: AdaptiveAGD_struct pub struct AdaptiveAGD { step_size: f64, } // ANCHOR_END: AdaptiveAGD_struct // ANCHOR: AdaptiveAGD_impl_new impl AdaptiveAGD { /// Create a new instance of AdaptiveAGD with a given step size. /// /// The step size should be 1 / L, where L is the Lipschitz constant /// of the gradient of the objective function. pub fn new(step_size: f64) -> Self { Self { step_size } } } // ANCHOR_END: AdaptiveAGD_impl_new /// Run the optimizer for `n_steps` iterations. /// /// # Arguments /// - `weights`: mutable reference to the parameter vector (x₀), will be updated in-place. /// - `grad_fn`: a function that computes ∇f(x) for a given x. /// - `n_steps`: number of optimization steps to perform. /// /// This implementation follows: /// /// ``` /// y_{k+1} = x_k - α ∇f(x_k) /// t_{k+1} = (1 + sqrt(1 + 4 t_k²)) / 2 /// x_{k+1} = y_{k+1} + ((t_k - 1)/t_{k+1}) * (y_{k+1} - y_k) /// ``` // ANCHOR: AdaptiveAGD_impl_run impl Optimizer for AdaptiveAGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let mut t_k: f64 = 1.0; let mut y_k = weights.clone(); for _ in 0..n_steps { let grad = grad_fn(weights); let mut y_next = weights.clone(); Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; }); let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt()); Zip::from(&mut *weights) .and(&y_next) .and(&y_k) .for_each(|x, &y1, &y0| { *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0); }); y_k = y_next; t_k = t_next; } } } // ANCHOR_END: AdaptiveAGD_impl_run }
1. Trait-based design
We define a trait called Optimizer
to represent any optimizer that can update model weights based on gradients. In contrast to the previous sections where we mostly implemented step
functions, here the trait requires implementors to define a run
method with the following signature:
#![allow(unused)] fn main() { pub trait Optimizer { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ); } }
This method takes:
- A mutable reference to a vector of weights (
Array1<f64>
). - A function that computes the gradient of the loss with respect to the weights. This
grad_fn
function takes itself a borrowed reference to the weights&Array1<f64>
and outputs a new arrayArray1<f64>
. - The number of iterations to perform.
This trait run
defines the whole optimization algorithm.
2. Gradient descent (GD)
The GD
struct implements basic gradient descent with a fixed step size:
#![allow(unused)] fn main() { pub struct GD { step_size: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl GD { pub fn new(step_size: f64) -> Self { Self { step_size } } } }
And implements Optimizer
by subtracting the gradient scaled by the step size from the weights at each iteration.
#![allow(unused)] fn main() { impl Optimizer for GD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { for _ in 0..n_steps { let grads = grad_fn(weights); weights.zip_mut_with(&grads, |w, &g| { *w -= self.step_size * g; }); } } } }
Some notes:
-
At each iteration, we compute the gradient with
let grads = grad_fn(weights)
, which is fine but it reallocates a new vector at each call. If we wanted to optimize the gradient computation, we could pre-allocate a buffer outside the loop and pass a mutable reference into the gradient function to avoid repeated allocations. This would require to change the signature of thegrad_fn
. -
weights.zip_mut_with(&grads, |w, &g| {{ ... }})
: This is a mutable zip operation from thendarray
crate. It walks overweights
andgrads
, applying the closure to each pair. -
zip_mut_with is a method defined by the
Zip
trait, which is implemented for ArrayBase, and in particular for Array1. That’s why we can call it directly on weights
. -
In the closure statement we wrote:
|w, &g| *w -= self.step_size * g;
. Here,w
is a mutable reference to each weight element, so we dereference it using*w
to update its value. The&g
in the closure means we’re pattern-matching by reference to avoid cloning or copying eachf64
.
3. Accelerated gradient descent (AGD)
The AGD
struct extends gradient descent with a classical momentum term:
#![allow(unused)] fn main() { pub struct AGD { step_size: f64, momentum: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl AGD { pub fn new(step_size: f64, momentum: f64) -> Self { Self { step_size, momentum, } } } }
This algorithm adds momentum to classical gradient descent. Instead of updating weights using just the current gradient, it maintains a velocity vector that accumulates the influence of past gradients. This helps smooth the trajectory and accelerates convergence on convex problems.
#![allow(unused)] fn main() { impl Optimizer for AGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let n: usize = weights.len(); let mut velocity: Array1<f64> = Array::zeros(n); for _ in 0..n_steps { let grads = grad_fn(weights); for ((w, g), v) in weights .iter_mut() .zip(grads.iter()) .zip(velocity.iter_mut()) { *v = self.momentum * *v - self.step_size * g; *w += *v; } } } } }
Some notes:
-
We initialize a vector of zeros to track the momentum (velocity) across steps. It has the same length as the weights. This is achieved with:
let mut velocity: Array1<f64> = Array::zeros(n)
. Note that we could have defined the velocity as an internal state variable within the struct defintion. -
We use a triple nested zip to unpack the values of the weights, gradients, and velocity:
for ((w, g), v) in weights.iter_mut().zip(grads.iter()).zip(velocity.iter_mut())
. Here,weights.iter_mut()
gives a mutable reference to each weight,grads.iter()
provides read-only access to each gradient,velocity.iter_mut()
allows in-place updates of the velocity vector.
This pattern allows us to update everything in one pass, element-wise.
-
Within the nested zip closure, we update the velocity using the momentum term and current gradient:
*v = self.momentum * *v - self.step_size * g;
-
The weight is updated using the new velocity:
*w += *v;
. Again, we dereferencew
because it's a mutable reference.
4. Nesterov-based adaptive AGD
The AdaptiveAGD
struct implements a more advanced optimizer based on Nesterov's method and FISTA:
#![allow(unused)] fn main() { pub struct AdaptiveAGD { step_size: f64, } }
It has a constructor:
#![allow(unused)] fn main() { impl AdaptiveAGD { /// Create a new instance of AdaptiveAGD with a given step size. /// /// The step size should be 1 / L, where L is the Lipschitz constant /// of the gradient of the objective function. pub fn new(step_size: f64) -> Self { Self { step_size } } } }
This algorithm implements an accelerated method inspired by Nesterov’s momentum and the FISTA algorithm. The key idea is to introduce an extrapolation step between iterates, controlled by a sequence t_k
. This helps the optimizer "look ahead" and converge faster in smooth convex problems.
Update steps:
- Compute a temporary point
y_{k+1}
by taking a gradient step fromx_k
. - Update the extrapolation coefficient
t_{k+1}
. - Combine
y_{k+1}
andy_k
using a weighted average to get the new iteratex_{k+1}
.
#![allow(unused)] fn main() { impl Optimizer for AdaptiveAGD { fn run( &self, weights: &mut Array1<f64>, grad_fn: impl Fn(&Array1<f64>) -> Array1<f64>, n_steps: usize, ) { let mut t_k: f64 = 1.0; let mut y_k = weights.clone(); for _ in 0..n_steps { let grad = grad_fn(weights); let mut y_next = weights.clone(); Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; }); let t_next = 0.5 * (1.0 + (1.0 + 4.0 * t_k * t_k).sqrt()); Zip::from(&mut *weights) .and(&y_next) .and(&y_k) .for_each(|x, &y1, &y0| { *x = y1 + ((t_k - 1.0) / t_next) * (y1 - y0); }); y_k = y_next; t_k = t_next; } } } }
Some notes:
-
We deliberately re-allocate multiple variables within the for loop (
grad
,y_next
,t_next
) but we could have pre-allocated buffers before the for loop. -
The algorithm keeps track of two sequences: the main iterate (
weights
) and the extrapolated one (y_k
). Before starting the for loop, we initializey_k
by cloning the weights:let mut y_k = weights.clone();
. -
The gradient is evaluated at the current weights, as in standard gradient descent:
let grad = grad_fn(weights);
. Sinceweights
is a mutable reference, we can pass it straightaway to ourgrad_fn
. -
A temporary variable to store the new extrapolated point. This is again a full allocation and clone for clarity.
let mut y_next = weights.clone();
. -
We next compute:
y_{k+1} = x_k - α ∇f(x_k)
using an element-wise operation:Zip::from(&mut y_next).and(&grad).for_each(|y, &g| { *y -= self.step_size * g; });
. This time, we rely on theZip::from
trait implement byndarray
. -
The new weights are obtained by combining
y_{k+1}
andy_k
. The triple zip walks over the current weights and both extrapolation points:Zip::from(&mut *weights)...
.
This optimizer is more involved than basic gradient descent but still relies on the same functional building blocks: closures, element-wise iteration, and vector arithmetic with ndarray
.
Summary
This design demonstrates a few Rust programming techniques:
- Traits for abstraction and polymorphism
- Structs to encapsulate algorithm-specific state
- Use of the
ndarray
crate for numerical data - Generic functions using closures for computing gradients