Introduction

In this book, we explore mid-price prediction in financial markets through the combined lens of statistical filtering and machine learning. The mid-price—halfway between the best bid and best ask—captures the evolving consensus of market participants and serves as a natural target for short-term price forecasting.

We begin by implementing a Kalman Filter as a statistical baseline for sequential state estimation. From there, we train and evaluate a range of machine learning models to assess how modern approaches compare with classical inference methods.

Our goals are two-fold:

Implement classical inference algorithms such as the Kalman Filter in C++ for efficiency and precision, with Python bindings for experimentation. Python bindings will be provided as well.
Compare these algorithms against machine learning models in terms of predictive accuracy, robustness, and computational performance.

The comparison will be carried out on classical mid-price forecasting datasets, including:

FI-2010: a publicly available benchmark dataset for mid-price forecasting for limit order book data
LOBster: a real limit order book dataset with millisecond-level resolution

By the end, we should have a practical understanding of how statistical filters and machine learning can be applied to mid-price prediction.

Model formulation

We consider a linear dynamical system with additive Gaussian noise:

$x_{k} = A x_{k - 1} + B u_{k} + w_{k}, w_{k} \sim N (0, Q)$ $z_{k} = H x_{k} + v_{k}, v_{k} \sim N (0, R)$

where:

$x_{k} \in R^{n}$ is the state vector at time step $k$ ,
$u_{k} \in R^{m}$ is an optional control input,
$z_{k} \in R^{p}$ is the measurement vector,
$A \in R^{n \times n}$ is the state transition matrix,
$B \in R^{n \times m}$ is the control-input matrix,
$H \in R^{p \times n}$ is the observation matrix,
$Q \in R^{n \times n}$ is the process noise covariance,
$R \in R^{p \times p}$ is the measurement noise covariance.

Kalman filtering

The Kalman filter maintains the mean and covariance of the posterior distribution $p (x_{k} ∣ z_{1 : k})$ under the Gaussian assumption.

Prediction step

Given the previous posterior $(\overset{x}{^}_{k - 1}, P_{k - 1})$ :

$\overset{x}{^}_{k}^{-} = A \overset{x}{^}_{k - 1} + B u_{k}$ $P_{k}^{-} = A P_{k - 1} A^{⊤} + Q$

Here, $(\overset{x}{^}_{k}^{-}, P_{k}^{-})$ are the predicted state mean and covariance.

Update step

With a new measurement $z_{k}$ :

Innovation (measurement residual): $y_{k} = z_{k} - H \overset{x}{^}_{k}^{-}$
Innovation covariance: $S_{k} = H P_{k}^{-} H^{⊤} + R$
Kalman gain: $K_{k} = P_{k}^{-} H^{⊤} S_{k}^{- 1}$
Updated mean and covariance: $\overset{x}{^}_{k} = \overset{x}{^}_{k}^{-} + K_{k} y_{k}$ $P_{k} = (I - K_{k} H) P_{k}^{-} (I - K_{k} H)^{⊤} + K_{k} R K_{k}^{⊤}$

The filter proceeds recursively for each time step.

Keyboard shortcuts

Simple algorithms for limit order book mid-price prediction

Introduction

Model formulation

Kalman filtering

Prediction step

Update step