Linear Regression¶

In this note, we implement linear regression in one dimension using two approaches:

  1. A pure Python implementation using only lists and basic operations.

  2. A NumPy implementation using the analytical solution.

We will also generate some synthetic data for demonstration.

Mathematical formulation¶

We are given a dataset of \(n\) input-output pairs \((x_i, y_i)_{i=1}^{n}\). We assume a linear relationship: \(y_i = a x_i + b\). Our goal is to find the best slope \(a\) and intercept \(b\) that minimize the mean squared error:

\[ \mathrm{MSE}(a, b) = \frac{1}{n} \sum_{i=1}^n (y_i - (a x_i + b))^2 \]

The optimal parameters are given by:

\[ a = \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2}\,, \quad b = \overline{y} - a \overline{x} \]

where \(\overline{x}\) and \(\overline{y}\) are the means of \(x\) and \(y\), respectively.

Step 1: Generate toy data with noise¶

We simulate a linear relationship with added Gaussian noise:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n_samples = 50

x = np.linspace(0, 10, n_samples)
true_a = 2.5
true_b = 1.0
noise = np.random.normal(0, 2, size=n_samples)

y = true_a * x + true_b + noise

plt.figure()
plt.scatter(x, y, label="Noisy data")
plt.plot(x, true_a * x + true_b, color="green", label="True line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Toy data with noise")
plt.legend()
plt.show()

We now use this data (converted to lists) for a manual implementation.

Step 2: Pure Python implementation using lists¶

We compute the slope and intercept using the mean, covariance, and variance:

x_list = x.tolist()
y_list = y.tolist()

def mean(values):
    return sum(values) / len(values)

def covariance(x, y, x_mean, y_mean):
    return sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y)) / len(x)

def variance(values, mean_value):
    return sum((v - mean_value) ** 2 for v in values) / len(values)

x_mean = mean(x_list)
y_mean = mean(y_list)

cov_xy = covariance(x_list, y_list, x_mean, y_mean)
var_x = variance(x_list, x_mean)

a = cov_xy / var_x
b = y_mean - a * x_mean

print(f"Estimated slope (pure Python): {a:.2f}")
print(f"Estimated intercept (pure Python): {b:.2f}")

Step 3: NumPy implementation¶

We repeat the same process using NumPy, but without calling a solver:

x_mean = np.mean(x)
y_mean = np.mean(y)

cov_xy = np.mean((x - x_mean) * (y - y_mean))
var_x = np.mean((x - x_mean) ** 2)

a = cov_xy / var_x
b = y_mean - a * x_mean

print(f"Estimated slope (NumPy): {a:.2f}")
print(f"Estimated intercept (NumPy): {b:.2f}")

Step 4: Visualize the fitted line¶

y_pred = a * x + b

plt.scatter(x, y, label="Noisy data")
plt.plot(x, true_a * x + true_b, color="green", linestyle="--", label="True line")
plt.plot(x, y_pred, color="red", label="Fitted line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear regression fit")
plt.legend()
plt.show()

Summary¶

  • We generated toy data with a known linear model and added noise.

  • We implemented linear regression in two ways: manually using lists, and using NumPy.

  • In both cases, we derived the slope and intercept from the definitions of covariance and variance.