Gradient descent

As an exercise, we now implement the gradient descent algorithm to optimize the Ridge regression loss. The gradient has a closed-form expression and can be efficiently computed. We implement it in two different ways as we did for the loss function.

Gradient descent implementation

Gradient descent iteratively updates the parameter β using the gradient of the loss function:

$β \leftarrow β - η \cdot \nabla_{β} L (β)$

Where η is the learning rate, and ∇βL(β) is the gradient of the loss.

We allow flexible experimentation by passing the gradient function as parameters:

#![allow(unused)]
fn main() {
pub fn gradient_descent(
    grad_fn: impl Fn(&[f64], &[f64], f64, f64) -> f64,
    x: &[f64],
    y: &[f64],
    lambda2: f64,
    lr: f64,
    n_iters: usize,
    init_beta: f64,
) -> f64 {
    let mut beta = init_beta;

    for _ in 0..n_iters {
        let grad = grad_fn(x, y, beta, lambda2);
        beta -= lr * grad;
    }

    beta
}
}

This version is generic, letting us plug in any valid grad_fn.

Gradient function: naive implementation

This version breaks the computation into two separate steps:

Compute the residuals $r_{i} := y_{i} - β x_{i}$
Compute the dot product between the residuals and the inputs: $\sum_{i = 1}^{n} r_{i} x_{i}$
Then assemble the get the gradient value

We first start by implementing our own dot function by relying on iterators, map chaining, and summing the results.

#![allow(unused)]
fn main() {
pub fn dot(a: &[f64], b: &[f64]) -> f64 {
    assert_eq!(a.len(), b.len(), "Input vectors must have the same length");
    a.iter().zip(b.iter()).map(|(xi, yi)| xi * yi).sum()
}
}

Our first implementation takes the following form:

#![allow(unused)]
fn main() {
pub fn grad_loss_function_naive(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let residuals: Vec<f64> = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| yi - beta * xi)
        .collect();
    let residuals_dot_x = dot(&residuals, x);

    -2.0 * residuals_dot_x / (n as f64) + 2.0 * lambda2 * beta
}
}

Gradient function: inlined iterator-based implementation

In this version, we fuse the residual and gradient computation into a single iterator chain. This avoids intermediate memory allocations and takes full advantage of Rust’s zero-cost abstraction model.

#![allow(unused)]
fn main() {
pub fn grad_loss_function_inline(x: &[f64], y: &[f64], beta: f64, lambda2: f64) -> f64 {
    assert_eq!(x.len(), y.len(), "x and y must have the same length");

    let n: usize = x.len();
    let grad_mse: f64 = x
        .iter()
        .zip(y.iter())
        .map(|(xi, yi)| 2.0 * (yi - beta * xi) * xi)
        .sum::<f64>()
        / (n as f64);

    -grad_mse + 2.0 * lambda2 * beta
}
}

Key differences:

The naive version allocates a temporary vectosr for the residuals and the dot product.
The inlined version is more idiomatic Rust: it avoids allocation and achieves better performance through iterator fusion.

Keyboard shortcuts

Learn Rust with Applied Sciences

Gradient descent