# Linear Regression

In this note, we implement linear regression in one dimension using two approaches:
1. A pure Python implementation using only lists and basic operations.
2. A NumPy implementation using the analytical solution.

We will also generate some synthetic data for demonstration.

## Mathematical formulation

We are given a dataset of $n$ input-output pairs $(x_i, y_i)_{i=1}^{n}$. We assume a linear relationship: $y_i = a x_i + b$. Our goal is to find the best slope $a$ and intercept $b$ that minimize the mean squared error:

$$ \mathrm{MSE}(a, b) = \frac{1}{n} \sum_{i=1}^n (y_i - (a x_i + b))^2 $$

The optimal parameters are given by:

$$
a = \frac{\sum (x_i - \overline{x})(y_i - \overline{y})}{\sum (x_i - \overline{x})^2}\,, \quad b = \overline{y} - a \overline{x}
$$

where $\overline{x}$ and $\overline{y}$ are the means of $x$ and $y$, respectively.


## Step 1: Generate toy data with noise

We simulate a linear relationship with added Gaussian noise:

```python
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)
n_samples = 50

x = np.linspace(0, 10, n_samples)
true_a = 2.5
true_b = 1.0
noise = np.random.normal(0, 2, size=n_samples)

y = true_a * x + true_b + noise

plt.figure()
plt.scatter(x, y, label="Noisy data")
plt.plot(x, true_a * x + true_b, color="green", label="True line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Toy data with noise")
plt.legend()
plt.show()
```

We now use this data (converted to lists) for a manual implementation.

## Step 2: Pure Python implementation using lists

We compute the slope and intercept using the mean, covariance, and variance:

```python
x_list = x.tolist()
y_list = y.tolist()

def mean(values):
    return sum(values) / len(values)

def covariance(x, y, x_mean, y_mean):
    return sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y)) / len(x)

def variance(values, mean_value):
    return sum((v - mean_value) ** 2 for v in values) / len(values)

x_mean = mean(x_list)
y_mean = mean(y_list)

cov_xy = covariance(x_list, y_list, x_mean, y_mean)
var_x = variance(x_list, x_mean)

a = cov_xy / var_x
b = y_mean - a * x_mean

print(f"Estimated slope (pure Python): {a:.2f}")
print(f"Estimated intercept (pure Python): {b:.2f}")
```

## Step 3: NumPy implementation

We repeat the same process using NumPy, but without calling a solver:

```python
x_mean = np.mean(x)
y_mean = np.mean(y)

cov_xy = np.mean((x - x_mean) * (y - y_mean))
var_x = np.mean((x - x_mean) ** 2)

a = cov_xy / var_x
b = y_mean - a * x_mean

print(f"Estimated slope (NumPy): {a:.2f}")
print(f"Estimated intercept (NumPy): {b:.2f}")
```

## Step 4: Visualize the fitted line

```python
y_pred = a * x + b

plt.scatter(x, y, label="Noisy data")
plt.plot(x, true_a * x + true_b, color="green", linestyle="--", label="True line")
plt.plot(x, y_pred, color="red", label="Fitted line")
plt.xlabel("x")
plt.ylabel("y")
plt.title("Linear regression fit")
plt.legend()
plt.show()
```

## Summary

- We generated toy data with a known linear model and added noise.
- We implemented linear regression in two ways: manually using lists, and using NumPy.
- In both cases, we derived the slope and intercept from the definitions of covariance and variance.