16 State Space Models, Kalman Filtering, and Regime Switching

import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.statespace.structural import UnobservedComponents
from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor
from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression
from statsmodels.tsa.regime_switching.markov_autoregression import (
    MarkovAutoregression,
)
from statsmodels.tsa.statespace.mlemodel import MLEModel
from scipy import stats
from scipy.optimize import minimize as sp_minimize

plt.style.use('assets/book.mplstyle')
rng = np.random.default_rng(42)

16.0.1 Notation for this chapter

Symbol	Meaning	statsmodels
$\boldsymbol{\alpha}_t$	State vector at time $t$
$\mathbf{T}$	Transition matrix	`transition`
$\mathbf{Z}$	Observation (design) matrix	`design`
$\mathbf{Q}$	State noise covariance	`state_cov`
$\mathbf{H}$	Observation noise covariance	`obs_cov`
$\boldsymbol{\alpha}_{t\|t-1}$	Predicted state (before seeing $y_t$)
$\boldsymbol{\alpha}_{t\|t}$	Filtered state (after seeing $y_t$)	`.filtered_state`
$\boldsymbol{\alpha}_{t\|T}$	Smoothed state (using all data)	`.smoothed_state`
$\mathbf{K}_t$	Kalman gain
$\mathbf{v}_t$	Prediction error (innovation)
$S_t$	Regime indicator (Markov switching)	`.smoothed_marginal_probabilities`

16.1 Motivation

Chapter 15 introduced ARIMA and exponential smoothing as separate methods. This chapter reveals that they are both special cases of a single, more general framework: the state-space model.

The state-space framework matters for three reasons:

Unification. ARIMA, exponential smoothing, structural time series (trend + cycle + seasonal), and dynamic factor models are all state-space models with specific parameter restrictions. Once you understand the framework, you understand them all.
The Kalman filter. All state-space models are estimated via the Kalman filter — an algorithm that processes observations one at a time, maintaining an estimate of the unobserved “state” and updating it as new data arrives. This is how statsmodels.tsa fits every model in Chapter 15 (and this chapter).
Flexibility. The state-space framework lets you build custom models by specifying how the state evolves and how observations relate to the state. Statsmodels’ MLEModel class exposes this.

This chapter also covers Markov regime switching — models where the parameters change discretely between “regimes” (e.g., bull vs bear markets, recession vs expansion). These are a different kind of hidden state: discrete rather than continuous.

16.2 Mathematical Foundation

16.2.1 The linear Gaussian state-space model

The state-space model has two equations:

Transition equation (how the state evolves): \[ \boldsymbol{\alpha}_{t+1} = \mathbf{T}\boldsymbol{\alpha}_t + \mathbf{R}\boldsymbol{\eta}_t, \quad \boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}) \tag{16.1}\]

Observation equation (how we observe the state): \[ \mathbf{y}_t = \mathbf{Z}\boldsymbol{\alpha}_t + \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{H}) \tag{16.2}\]

What these mean:

$\boldsymbol{\alpha}_t$ is the state: the unobserved quantity we want to track (e.g., the “true” level of a noisy signal, the trend, the seasonal component).
$\mathbf{y}_t$ is the observation: what we actually measure. It is a noisy linear function of the state.
$\mathbf{T}$ determines how the state evolves over time.
$\mathbf{Z}$ determines how the state maps to observations.
$\mathbf{Q}$ is how much the state fluctuates (state noise).
$\mathbf{H}$ is how noisy the measurements are (observation noise).

Example: the local level model. The simplest state-space model: the “state” is the current level of the series, which evolves as a random walk, and we observe it with noise.

\[ \alpha_{t+1} = \alpha_t + \eta_t, \quad y_t = \alpha_t + \varepsilon_t. \]

Here $T = Z = 1$, $Q = \sigma^2_\eta$ (state noise), $H = \sigma^2_\varepsilon$ (observation noise). The Kalman filter estimates $\alpha_t$ (the true level) from the noisy observations $y_t$.

16.2.2 Why ARIMA and ETS are state-space models

ARIMA as state-space: An ARIMA(1,1,1) model $\Delta y_t = \phi \Delta y_{t-1} + \varepsilon_t + \theta \varepsilon_{t-1}$ can be written in state-space form with state $\boldsymbol{\alpha}_t = (\Delta y_t, \varepsilon_t)^\top$ and appropriate $\mathbf{T}$ and $\mathbf{Z}$ matrices. This is exactly what statsmodels.tsa.statespace.sarimax.SARIMAX does internally.

ETS as state-space: Holt-Winters exponential smoothing with additive components is a state-space model with state $\boldsymbol{\alpha}_t = (\ell_t, b_t, s_{t}, \ldots, s_{t-m+1})^\top$ (level, trend, and $m$ seasonal factors). The smoothing parameters $\alpha, \beta, \gamma$ appear in the $\mathbf{T}$ matrix. The Kalman filter reduces to the exponential smoothing recursions.

Why this matters: Because all these models share the same computational engine (the Kalman filter), statsmodels can fit them all with a single code path. Understanding the Kalman filter means understanding what SARIMAX.fit(), ExponentialSmoothing.fit(), and UnobservedComponents.fit() are doing under the hood.

16.2.3 The Kalman filter: predict, then update

The Kalman filter processes observations sequentially. At each time step $t$, it performs two operations:

Predict: Use the transition equation to predict the state before seeing $y_t$: \[ \boldsymbol{\alpha}_{t|t-1} = \mathbf{T}\boldsymbol{\alpha}_{t-1|t-1}, \quad \mathbf{P}_{t|t-1} = \mathbf{T}\mathbf{P}_{t-1|t-1}\mathbf{T}^\top + \mathbf{R}\mathbf{Q}\mathbf{R}^\top. \]

Update: Incorporate the new observation $y_t$ to correct the prediction: \[ \mathbf{v}_t = \mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1} \quad \text{(prediction error)} \] \[ \mathbf{F}_t = \mathbf{Z}\mathbf{P}_{t|t-1}\mathbf{Z}^\top + \mathbf{H} \quad \text{(error variance)} \] \[ \mathbf{K}_t = \mathbf{P}_{t|t-1}\mathbf{Z}^\top\mathbf{F}_t^{-1} \quad \text{(Kalman gain)} \] \[ \boldsymbol{\alpha}_{t|t} = \boldsymbol{\alpha}_{t|t-1} + \mathbf{K}_t\mathbf{v}_t \quad \text{(filtered state)} \]

The Kalman gain $\mathbf{K}_t$ is the key quantity. It balances how much to trust the prediction vs the observation:

Large state noise ($Q$ large), small observation noise ($H$ small): The state is unpredictable but observations are accurate. $\mathbf{K}_t \to \mathbf{I}$: trust the data, not the prediction. This corresponds to $\alpha \to 1$ in exponential smoothing.
Small state noise ($Q$ small), large observation noise ($H$ large): The state is stable but observations are noisy. $\mathbf{K}_t \to \mathbf{0}$: trust the prediction, smooth out the noise. This corresponds to $\alpha \to 0$ in exponential smoothing.

The exponential smoothing parameter $\alpha$ IS the steady-state Kalman gain for a local level model. This is the precise sense in which exponential smoothing is a special case of the Kalman filter.

16.2.4 Deriving the update from joint Gaussian conditioning

The Kalman update formulas are not arbitrary — they follow from a single fact about multivariate Gaussians. If a joint distribution is Gaussian:

\[ \begin{pmatrix} \boldsymbol{\alpha}_t \\ \mathbf{y}_t \end{pmatrix} \Bigg| \, y_1, \ldots, y_{t-1} \sim \mathcal{N}\!\left( \begin{pmatrix} \boldsymbol{\alpha}_{t|t-1} \\ \mathbf{Z}\boldsymbol{\alpha}_{t|t-1} \end{pmatrix}, \begin{pmatrix} \mathbf{P}_{t|t-1} & \mathbf{P}_{t|t-1}\mathbf{Z}^\top \\ \mathbf{Z}\mathbf{P}_{t|t-1} & \mathbf{F}_t \end{pmatrix} \right) \]

where $\mathbf{F}_t = \mathbf{Z}\mathbf{P}_{t|t-1}\mathbf{Z}^\top + \mathbf{H}$, then the conditional distribution $\boldsymbol{\alpha}_t \mid y_1, \ldots, y_t$ is Gaussian with:

\[ \text{mean:}\quad \boldsymbol{\alpha}_{t|t-1} + \mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1} (\mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1}) \] \[ \text{covariance:}\quad \mathbf{P}_{t|t-1} - \mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1} \mathbf{Z}\mathbf{P}_{t|t-1} \]

Identifying terms: the factor $\mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1}$ is the Kalman gain $\mathbf{K}_t$, the residual $\mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1}$ is the innovation $\mathbf{v}_t$, and the covariance simplifies to $(\mathbf{I} - \mathbf{K}_t\mathbf{Z})\mathbf{P}_{t|t-1}$. Every formula in the predict-update cycle drops out of this single conditioning identity. The Kalman filter is not a clever heuristic — it is exact Bayesian inference for a model that happens to stay Gaussian at every step.

16.2.5 Why the Kalman filter is optimal

For the linear Gaussian model (Equation 16.1, Equation 16.2), the Kalman filter produces the minimum-variance unbiased state estimate given all observations up to time $t$. That is:

\[ \boldsymbol{\alpha}_{t|t} = \mathbb{E}[\boldsymbol{\alpha}_t \mid y_1, \ldots, y_t] \]

and $\mathbf{P}_{t|t}$ is the conditional covariance. No other linear estimator can achieve a smaller mean squared error.

Why: The derivation above makes this immediate. The predict-update cycle is recursive application of Bayes’ rule to the Gaussian prior (prediction) and Gaussian likelihood (observation). For Gaussians, the Bayesian posterior is exact and closed-form — the Kalman filter computes this closed form at each step.

16.2.6 How MLE works for state-space models

The Kalman filter also produces the likelihood of the data given the model parameters. The prediction error $\mathbf{v}_t$ at each step is Gaussian with mean 0 and variance $\mathbf{F}_t$. The log-likelihood is:

\[ \ell = -\frac{1}{2}\sum_{t=1}^T \left[\ln|\mathbf{F}_t| + \mathbf{v}_t^\top \mathbf{F}_t^{-1} \mathbf{v}_t\right] \tag{16.3}\]

This is the prediction error decomposition. Each term is the log-density of one observation’s prediction error. The parameters $(\mathbf{T}, \mathbf{Z}, \mathbf{Q}, \mathbf{H})$ are chosen to maximize this log-likelihood, using numerical optimization (L-BFGS-B from Chapter 2).

This is what SARIMAX.fit() does: it runs the Kalman filter for a given set of ARIMA parameters, computes Equation 16.3, and uses L-BFGS-B to find the parameters that maximize it.

16.2.7 EM as an alternative to direct optimization

Direct numerical optimization of the innovations likelihood can be fragile — the surface is often flat in some directions, and optimizers can wander into regions where a covariance matrix loses positive definiteness. The EM algorithm offers a more stable alternative for state-space models. It treats the states $\boldsymbol{\alpha}_{1:T}$ as missing data:

E-step: Run the Kalman smoother to compute $\mathbb{E}[\boldsymbol{\alpha}_t \mid y_{1:T}]$ and the smoothed covariances under current parameters. These are the sufficient statistics for the complete-data likelihood.

M-step: Update the system matrices analytically. For the local level model, the updates are simple:

\[ \hat{\sigma}^2_\varepsilon = \frac{1}{T}\sum_{t=1}^T \left[(y_t - \alpha_{t|T})^2 + P_{t|T}\right] \] \[ \hat{\sigma}^2_\eta = \frac{1}{T-1}\sum_{t=1}^{T-1} \left[(\alpha_{t+1|T} - \alpha_{t|T})^2 + P_{t+1|T} + P_{t|T} - 2C_{t+1,t|T}\right] \]

where $C_{t+1,t|T}$ is the lag-one smoothed covariance (from the RTS backward pass). Each iteration increases the likelihood, and the algorithm avoids matrix-parameter constrained optimization entirely. Statsmodels uses direct optimization by default, but fit(method='em') is available for UnobservedComponents when convergence is difficult.

16.2.8 Smoothing: using the future to improve the past

The Kalman filter uses only observations up to time $t$ to estimate $\boldsymbol{\alpha}_t$. The Kalman smoother uses ALL observations $y_1, \ldots, y_T$ to re-estimate every state:

\[ \boldsymbol{\alpha}_{t|T} = \mathbb{E}[\boldsymbol{\alpha}_t \mid y_1, \ldots, y_T] \]

The standard algorithm is the Rauch-Tung-Striebel (RTS) smoother. After the forward Kalman filter pass, a backward pass computes:

\[ \mathbf{L}_t = \mathbf{P}_{t|t}\mathbf{T}^\top \mathbf{P}_{t+1|t}^{-1} \quad \text{(smoother gain)} \] \[ \boldsymbol{\alpha}_{t|T} = \boldsymbol{\alpha}_{t|t} + \mathbf{L}_t(\boldsymbol{\alpha}_{t+1|T} - \boldsymbol{\alpha}_{t+1|t}) \] \[ \mathbf{P}_{t|T} = \mathbf{P}_{t|t} + \mathbf{L}_t(\mathbf{P}_{t+1|T} - \mathbf{P}_{t+1|t})\mathbf{L}_t^\top \]

for $t = T-1, T-2, \ldots, 1$, starting from the final filtered values $\boldsymbol{\alpha}_{T|T}$ and $\mathbf{P}_{T|T}$. The smoother gain $\mathbf{L}_t$ controls how much future information corrects the filtered estimate at time $t$. The correction is the difference between what time $t{+}1$ turned out to be (smoothed) and what we predicted it would be (one-step-ahead). Smoothed estimates are always at least as good as filtered estimates (lower MSE) because they use strictly more data.

Statsmodels gives both: .filtered_state and .smoothed_state.

16.2.9 Markov regime switching

Sometimes the parameters of a time series change over time. In a Markov switching model, the parameters switch between $K$ discrete regimes according to a hidden Markov chain:

\[ \Pr(S_t = j \mid S_{t-1} = i) = p_{ij} \]

At each time $t$, the data follows a regime-specific model: $y_t \mid S_t = j \sim \mathcal{N}(\mu_j, \sigma^2_j)$.

The Hamilton filter (Hamilton, 1989) estimates the regime probabilities $\Pr(S_t = j \mid y_1, \ldots, y_t)$ using the following recursion. Let $\boldsymbol{\xi}_{t|t} = (\Pr(S_t = 1 \mid \mathcal{Y}_t), \ldots, \Pr(S_t = K \mid \mathcal{Y}_t))^\top$ be the vector of filtered regime probabilities.

Predict: Propagate through the transition matrix $\mathbf{P}$: \[ \boldsymbol{\xi}_{t|t-1} = \mathbf{P}^\top \boldsymbol{\xi}_{t-1|t-1} \]

Update: Weight by the regime-conditional likelihoods: \[ \xi_{t|t}^{(j)} = \frac{ f(y_t \mid S_t = j) \, \xi_{t|t-1}^{(j)} }{ \sum_{k=1}^K f(y_t \mid S_t = k) \, \xi_{t|t-1}^{(k)} } \]

The denominator is the marginal density of $y_t$ — summing its log across $t$ gives the log-likelihood. This is Bayes’ rule applied at each step, exactly as the Kalman filter applies Bayes’ rule for continuous states. The computational cost is $O(K^2)$ per step (the matrix-vector product), compared to $O(m^3)$ for the Kalman filter with $m$-dimensional continuous state.

Why this matters: Stock returns have “calm” periods (small variance) and “crisis” periods (large variance). GDP growth switches between expansion and recession. Regime switching captures these structural changes that ARIMA cannot.

16.3 The Algorithm and SciPy Implementation

16.3.1 The Kalman filter algorithm

Input: observations y_1, ..., y_T; system matrices T, Z, Q, H, R
       initial state α_{1|0} and covariance P_{1|0}
Output: filtered states α_{t|t}, prediction errors v_t, log-likelihood

ℓ = 0
for t = 1, ..., T:
    # Prediction error
    v_t = y_t − Z α_{t|t−1}
    F_t = Z P_{t|t−1} Z' + H

    # Kalman gain
    K_t = P_{t|t−1} Z' F_t⁻¹

    # Update (filtered) state
    α_{t|t} = α_{t|t−1} + K_t v_t
    P_{t|t} = (I − K_t Z) P_{t|t−1}

    # Predict next state
    α_{t+1|t} = T α_{t|t}
    P_{t+1|t} = T P_{t|t} T' + R Q R'

    # Log-likelihood contribution
    ℓ += −0.5 [ln|F_t| + v_t' F_t⁻¹ v_t]

return {α_{t|t}}, ℓ

16.3.2 `UnobservedComponents`: structural time series

UnobservedComponents decomposes a time series into interpretable parts using a state-space model. Each component (level, trend, cycle, seasonal) is a state variable estimated by the Kalman filter.

# Simulate: random walk level + stochastic cycle
n = 200
level_true = np.cumsum(0.3 * rng.standard_normal(n))
cycle_true = 5 * np.sin(2 * np.pi * np.arange(n) / 20)
y_uc = level_true + cycle_true + rng.standard_normal(n)

# Fit unobserved components model
# level='local level': the level is a random walk
# cycle=True: include a stochastic cycle component
# stochastic_cycle=True: the cycle amplitude and phase can change
uc = UnobservedComponents(
    y_uc,
    level='local level',
    cycle=True,
    stochastic_cycle=True,
).fit(disp=0)

print(f"Estimated parameters:")
print(f"  Observation noise (sigma2_eps): {uc.params[0]:.4f}")
print(f"  Level noise (sigma2_eta):       {uc.params[1]:.4f}")
print(f"  Cycle noise:                    {uc.params[2]:.4f}")
print(f"  Cycle frequency:                {uc.params[3]:.4f}")
print(f"  Cycle period:                   {2*np.pi/uc.params[3]:.1f} time units")

Estimated parameters:
  Observation noise (sigma2_eps): 0.9812
  Level noise (sigma2_eta):       0.0325
  Cycle noise:                    0.0042
  Cycle frequency:                0.3136
  Cycle period:                   20.0 time units

fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True)

axes[0].plot(y_uc, linewidth=0.5, color='C0', alpha=0.7, label='Observed')
axes[0].plot(level_true + cycle_true, '--', color='gray',
             linewidth=0.5, label='True signal')
axes[0].legend()
axes[0].set_title('Observed data')

axes[1].plot(uc.level.smoothed, color='C1', linewidth=1.5,
             label='Smoothed level')
axes[1].plot(level_true, '--', color='gray', linewidth=0.5,
             label='True level')
axes[1].legend()
axes[1].set_title('Level component')

axes[2].plot(uc.cycle.smoothed, color='C2', linewidth=1.5,
             label='Smoothed cycle')
axes[2].plot(cycle_true, '--', color='gray', linewidth=0.5,
             label='True cycle')
axes[2].legend()
axes[2].set_title('Cycle component')

plt.tight_layout()
plt.show()

Figure 16.1: Unobserved components decomposition. The Kalman smoother separates the noisy observations into a trend (level) and a cyclical component. The smoothed estimates use all data (past and future) and are therefore smoother than the filtered estimates.

What to notice: The smoothed level tracks the true random walk closely, and the smoothed cycle recovers the sinusoidal pattern. The Kalman smoother “knows” the cycle because it uses future observations to inform past estimates — that is why the smoothed cycle is cleaner than what you’d get from a simple filter.

16.3.3 Custom state-space models via `MLEModel`

UnobservedComponents covers common decompositions, but statsmodels’ real power is MLEModel — the base class that lets you define any linear Gaussian state-space model by specifying the system matrices directly. Below, we build a local level model from scratch as an MLEModel subclass to show the machinery.

class LocalLevel(MLEModel):
    """Local level model: y_t = alpha_t + eps_t,
    alpha_{t+1} = alpha_t + eta_t."""
    _start_params = [1.0, 1.0]
    _param_names = ['sigma2_eps', 'sigma2_eta']

    def __init__(self, endog):
        # k_states=1: one state variable (the level)
        super().__init__(
            endog, k_states=1, initialization='diffuse',
        )
        # Fix the matrices that don't depend on params
        self['transition', 0, 0] = 1.0    # T = 1
        self['design', 0, 0] = 1.0        # Z = 1
        self['selection', 0, 0] = 1.0     # R = 1

    def update(self, params, **kwargs):
        """Called by the optimizer at each step."""
        params = super().update(params, **kwargs)
        # params[0] = obs noise, params[1] = state noise
        self['obs_cov', 0, 0] = params[0]      # H
        self['state_cov', 0, 0] = params[1]    # Q


ll_custom = LocalLevel(y_uc).fit(disp=0)
# Compare against UC local level fit on same data
uc_ll_check = UnobservedComponents(
    y_uc, level='local level',
).fit(disp=0)
print(f"Custom MLEModel estimates:")
print(f"  sigma2_eps = {ll_custom.params[0]:.4f}")
print(f"  sigma2_eta = {ll_custom.params[1]:.4f}")
print(f"  Log-likelihood: {ll_custom.llf:.2f}")
print(f"\nMatches UC local level: "
      f"{abs(ll_custom.llf - uc_ll_check.llf) < 1.0}")

Custom MLEModel estimates:
  sigma2_eps = -0.1913
  sigma2_eta = 3.5437
  Log-likelihood: -397.47

Matches UC local level: True

The pattern is always the same: subclass MLEModel, set fixed matrices in __init__, and update parameter-dependent matrices in update. This is how every model in statsmodels.tsa.statespace works internally — UnobservedComponents, SARIMAX, and DynamicFactor are all MLEModel subclasses with pre-built update methods.

16.3.4 `DynamicFactor`: shared latent dynamics

When multiple time series move together, a dynamic factor model explains the co-movement through a small number of latent factors. Each observed series is a linear combination of the factors plus idiosyncratic noise — a state-space model where the state is the vector of factors.

# Simulate 3 series driven by 1 common factor
n_df = 200
factor_true = np.cumsum(0.3 * rng.standard_normal(n_df))
loadings_true = np.array([1.0, 0.7, -0.5])
Y_df = np.column_stack([
    l * factor_true + rng.standard_normal(n_df)
    for l in loadings_true
])

df_mod = DynamicFactor(
    Y_df, k_factors=1, factor_order=1,
).fit(disp=0)

print(f"Estimated factor loadings:")
for i in range(3):
    print(f"  Series {i}: {df_mod.params[i]:.3f} "
          f"(true: {loadings_true[i]:.1f})")
fac_corr = np.corrcoef(
    df_mod.factors.smoothed[0], factor_true,
)[0, 1]
print(f"\nFactor-true correlation: {fac_corr:.4f}")

Estimated factor loadings:
  Series 0: -0.323 (true: 1.0)
  Series 1: -0.221 (true: 0.7)
  Series 2: 0.162 (true: -0.5)

Factor-true correlation: -0.9208

16.3.5 `MarkovRegression`: regime switching

# Simulate two-regime data
n_ms = 300

# Generate regime sequence (hidden Markov chain)
regime = np.zeros(n_ms, dtype=int)
for t in range(1, n_ms):
    if regime[t-1] == 0:
        regime[t] = 1 if rng.random() < 0.02 else 0  # rarely switch out of 0
    else:
        regime[t] = 0 if rng.random() < 0.05 else 1  # switch back more often

# Regime-specific parameters
y_ms = np.where(
    regime == 0,
    rng.normal(0.05, 0.5, n_ms),   # regime 0: calm, positive mean
    rng.normal(-0.02, 2.0, n_ms),  # regime 1: volatile, slight negative
)

# Fit Markov switching model
# k_regimes=2: two regimes
# switching_variance=True: each regime has its own variance
ms = MarkovRegression(
    y_ms, k_regimes=2, trend='c', switching_variance=True,
).fit(disp=0)

print(f"Regime 0: mean = {ms.params[2]:.3f}, "
      f"sigma = {np.sqrt(ms.params[4]):.3f}")
print(f"Regime 1: mean = {ms.params[3]:.3f}, "
      f"sigma = {np.sqrt(ms.params[5]):.3f}")
print(f"P(stay in 0): {ms.params[0]:.3f}")
print(f"P(stay in 1): {1-ms.params[1]:.3f}")

Regime 0: mean = -0.029, sigma = 0.518
Regime 1: mean = -0.624, sigma = 1.857
P(stay in 0): 0.994
P(stay in 1): 0.955

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 5), sharex=True)

# Data colored by true regime
colors = np.where(regime == 0, 'C0', 'C1')
for t in range(n_ms):
    ax1.plot(t, y_ms[t], '.', color=colors[t], markersize=2)
ax1.set_ylabel('y')
ax1.set_title('Data (colored by true regime)')

# Estimated regime probabilities
ax2.fill_between(range(n_ms),
                 ms.smoothed_marginal_probabilities[:, 1],
                 alpha=0.5, color='C1', label='P(regime 1)')
ax2.plot(regime * 0.9 + 0.05, linewidth=0.5, color='C3',
         alpha=0.5, label='True regime')
ax2.set_ylabel('Probability')
ax2.set_xlabel('Time')
ax2.legend()
ax2.set_title('Estimated regime probabilities')

plt.tight_layout()
plt.show()

Figure 16.2: Markov switching model applied to two-regime data. Top: the data, colored by true regime. Bottom: the model’s estimated probability of being in regime 1 (high volatility). The model correctly identifies most regime switches.

Interpreting the output: The model correctly identifies the two regimes: regime 0 is calm (mean ≈ 0.05, low variance) and regime 1 is volatile (mean ≈ 0, high variance). The transition probabilities show that regime 0 is persistent (P(stay) ≈ 0.98) while regime 1 is less so (P(stay) ≈ 0.95).

16.3.6 `MarkovAutoregression`: regime-dependent dynamics

MarkovRegression switches the mean and variance across regimes but assumes no serial dependence within each regime. When the autoregressive structure itself changes between regimes, use MarkovAutoregression. This models $y_t$ as an AR process whose coefficients, intercept, and variance all depend on the current regime $S_t$.

# Simulate AR(1) with regime-switching persistence
n_ar = 400
regime_ar = np.zeros(n_ar, dtype=int)
for t in range(1, n_ar):
    if regime_ar[t-1] == 0:
        regime_ar[t] = int(rng.random() < 0.03)
    else:
        regime_ar[t] = int(rng.random() > 0.05)

# Regime 0: low persistence, regime 1: high persistence
y_ar = np.zeros(n_ar)
for t in range(1, n_ar):
    phi = 0.3 if regime_ar[t] == 0 else 0.9
    sigma = 0.5 if regime_ar[t] == 0 else 1.5
    y_ar[t] = phi * y_ar[t-1] + sigma * rng.standard_normal()

mar = MarkovAutoregression(
    y_ar, k_regimes=2, order=1,
    switching_ar=True, switching_variance=True,
).fit(disp=0)

print(f"Regime 0: AR(1) coeff = {mar.params[4]:.3f}, "
      f"sigma = {np.sqrt(mar.params[6]):.3f}")
print(f"Regime 1: AR(1) coeff = {mar.params[5]:.3f}, "
      f"sigma = {np.sqrt(mar.params[7]):.3f}")

Regime 0: AR(1) coeff = 0.251, sigma = 0.500
Regime 1: AR(1) coeff = 1.933, sigma = 0.908

The key difference from MarkovRegression: here the model captures the idea that not just the level changes across regimes, but the dynamics — how strongly the series depends on its own past.

16.4 Variations, Diagnostics, and Pitfalls

16.4.1 Choosing the right structural model

Model	State	Best for
Local level	Random walk	Noisy signal tracking
Local linear trend	Level + slope	Trend extrapolation
Cycle	Stochastic sinusoid	Business cycles
Seasonal	$m$ seasonal factors	Monthly/quarterly data
Combination	Multiple components	Decomposition

# Compare local level vs local linear trend
uc_ll = UnobservedComponents(y_uc, level='local level').fit(disp=0)
uc_llt = UnobservedComponents(y_uc, level='local linear trend').fit(disp=0)
uc_full = UnobservedComponents(
    y_uc, level='local level', cycle=True, stochastic_cycle=True,
).fit(disp=0)

print(f"{'Model':<25} {'AIC':>8} {'BIC':>8}")
print(f"{'Local level':<25} {uc_ll.aic:>8.1f} {uc_ll.bic:>8.1f}")
print(f"{'Local linear trend':<25} {uc_llt.aic:>8.1f} {uc_llt.bic:>8.1f}")
print(f"{'Level + cycle':<25} {uc_full.aic:>8.1f} {uc_full.bic:>8.1f}")

Model                          AIC      BIC
Local level                  798.2    804.8
Local linear trend           793.2    803.1
Level + cycle                626.2    639.3

16.4.2 Filtering vs smoothing: what the future buys you

The difference between filtered and smoothed states is diagnostic itself. Large discrepancies indicate periods where future data substantially revised the filter’s real-time estimate — often around structural changes or outliers.

fig, (ax1, ax2) = plt.subplots(
    2, 1, figsize=(10, 5), sharex=True,
)
ax1.plot(y_uc, linewidth=0.3, alpha=0.5, label='Observed')
ax1.plot(
    uc.level.filtered, linewidth=1.0,
    label='Filtered', color='C1',
)
ax1.plot(
    uc.level.smoothed, linewidth=1.0,
    label='Smoothed', color='C2', linestyle='--',
)
ax1.legend(fontsize=8)
ax1.set_title('Filtered vs smoothed level')

# The revision: how much the smoother changed
# the filter's estimate
revision = uc.level.smoothed - uc.level.filtered
ax2.fill_between(
    range(len(revision)), revision,
    alpha=0.5, color='C3',
)
ax2.axhline(0, color='gray', linewidth=0.5)
ax2.set_ylabel('Smoothed − Filtered')
ax2.set_xlabel('Time')
ax2.set_title('Smoother revision (large = hindsight helped)')

plt.tight_layout()
plt.show()

Figure 16.3: Filtered vs smoothed level estimates for the UC model. The smoother (which uses future data) is less volatile. The largest discrepancies occur where the signal changes direction — exactly where hindsight helps most.

16.4.3 Diagnostics: standardized prediction errors

For a well-specified state-space model, the prediction errors $v_t / \sqrt{F_t}$ should be i.i.d. $\mathcal{N}(0, 1)$. Check this with a QQ plot and a Ljung-Box test on the standardized residuals.

std_resid = uc.standardized_forecasts_error[0]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4))

ax1.plot(std_resid, linewidth=0.5)
ax1.axhline(0, color='gray', linewidth=0.5)
ax1.axhline(2, color='C1', linestyle='--', alpha=0.3)
ax1.axhline(-2, color='C1', linestyle='--', alpha=0.3)
ax1.set_xlabel('Time')
ax1.set_ylabel('Standardized error')
ax1.set_title('Prediction errors (should be white noise)')

stats.probplot(std_resid, plot=ax2)
ax2.set_title('QQ plot (should be linear)')

plt.tight_layout()
plt.show()

# Ljung-Box test
from statsmodels.stats.diagnostic import acorr_ljungbox
lb = acorr_ljungbox(std_resid, lags=10, return_df=True)
print(f"Ljung-Box: all p > 0.05? {(lb['lb_pvalue'] > 0.05).all()}")

Figure 16.4: State-space model diagnostics. Left: standardized prediction errors should look like white noise. Right: QQ plot checks that the errors are approximately normal.

Ljung-Box: all p > 0.05? True

16.4.4 Common pitfalls

Convergence warnings: State-space MLE is nonconvex. If fit() warns about convergence, try different starting values with start_params or a different optimizer with method.
Negative variance estimates: Sometimes the optimizer finds $\hat{\sigma}^2 < 0$, which is nonsensical. This usually means the model is overparameterized — simplify the component structure.
Non-positive-definite Hessian: The variance-covariance of the parameter estimates is unreliable. This happens at boundary solutions (e.g., a variance component estimated at exactly 0).

16.5 From-Scratch Implementation

16.5.1 Kalman filter for the local level model

def kalman_local_level(y, sigma2_eta, sigma2_eps):
    """Kalman filter for the local level model.

    State equation:  α_{t+1} = α_t + η_t,  η_t ~ N(0, σ²_η)
    Observation:     y_t = α_t + ε_t,       ε_t ~ N(0, σ²_ε)

    This is Algorithm 16.1 specialized to T=Z=R=1.

    Parameters
    ----------
    y : array (T,)
        Observations.
    sigma2_eta : float
        State noise variance (how much the level fluctuates).
    sigma2_eps : float
        Observation noise variance (how noisy the measurements are).

    Returns
    -------
    alpha_filt : array (T,)
        Filtered state estimates.
    P_filt : array (T,)
        Filtered state variances.
    ll : float
        Log-likelihood.
    """
    T = len(y)
    alpha_filt = np.zeros(T)
    P_filt = np.zeros(T)
    alpha_pred = np.zeros(T)
    P_pred = np.zeros(T)
    ll = 0

    # Initialize: diffuse prior (large uncertainty)
    alpha_pred[0] = y[0]
    P_pred[0] = sigma2_eta + sigma2_eps

    for t in range(T):
        # Prediction error: how far off was our prediction?
        v = y[t] - alpha_pred[t]
        F = P_pred[t] + sigma2_eps  # prediction error variance

        # Kalman gain: how much to trust the new observation
        K = P_pred[t] / F

        # Update: correct the prediction using the observation
        alpha_filt[t] = alpha_pred[t] + K * v
        P_filt[t] = (1 - K) * P_pred[t]

        # Predict next step
        if t < T - 1:
            alpha_pred[t+1] = alpha_filt[t]        # state doesn't drift
            P_pred[t+1] = P_filt[t] + sigma2_eta   # but uncertainty grows

        # Log-likelihood: prediction error decomposition
        ll -= 0.5 * (np.log(2 * np.pi) + np.log(F) + v**2 / F)

    return alpha_filt, P_filt, ll

16.5.2 Innovations likelihood from scratch

The from-scratch Kalman filter above already computes the log-likelihood, but let us isolate the innovations likelihood to make the prediction-error decomposition explicit. This function takes the system matrices and returns only the log-likelihood — the form an optimizer would call repeatedly during MLE.

# Generate local level data for the likelihood examples
y_ll = (np.cumsum(0.5 * rng.standard_normal(100))
        + rng.standard_normal(100))

def innovations_loglik(params, y):
    """Innovations log-likelihood for local level model.

    Implements @eq-kalman-ll directly: the log-likelihood is a sum
    of Gaussian log-densities evaluated at the prediction errors.

    Parameters
    ----------
    params : array (2,)
        [sigma2_eta, sigma2_eps].
    y : array (T,)
        Observations.

    Returns
    -------
    neg_ll : float
        Negative log-likelihood (for minimization).
    """
    sigma2_eta, sigma2_eps = params
    T = len(y)
    # Initialize
    a = y[0]                          # state prediction
    P = sigma2_eta + sigma2_eps       # state variance
    ll = 0.0

    for t in range(T):
        v = y[t] - a                  # innovation
        F = P + sigma2_eps            # innovation variance
        ll -= 0.5 * (np.log(2 * np.pi) + np.log(F)
                     + v**2 / F)
        K = P / F                     # Kalman gain
        a = a + K * v                 # filtered state
        P = (1 - K) * P + sigma2_eta  # predict next
    return -ll  # return negative for minimization


opt = sp_minimize(
    innovations_loglik, x0=[1.0, 1.0],
    args=(y_ll,), method='L-BFGS-B',
    bounds=[(1e-6, None), (1e-6, None)],
)
print(f"MLE via innovations likelihood:")
print(f"  sigma2_eta = {opt.x[0]:.4f}")
print(f"  sigma2_eps = {opt.x[1]:.4f}")
print(f"  Log-likelihood = {-opt.fun:.2f}")

MLE via innovations likelihood:
  sigma2_eta = 0.2222
  sigma2_eps = 1.1095
  Log-likelihood = -169.55

16.5.3 Hamilton filter from scratch

The Hamilton filter for a two-regime model with switching mean and variance. This mirrors the Kalman filter’s predict-update logic, but for discrete hidden states.

def hamilton_filter(y, mu, sigma2, P_trans):
    """Hamilton filter for K-regime Gaussian model.

    Parameters
    ----------
    y : array (T,)
        Observations.
    mu : array (K,)
        Regime-specific means.
    sigma2 : array (K,)
        Regime-specific variances.
    P_trans : array (K, K)
        Transition matrix. P_trans[i, j] = P(S_t=j | S_{t-1}=i).

    Returns
    -------
    xi_filt : array (T, K)
        Filtered regime probabilities.
    ll : float
        Log-likelihood.
    """
    T = len(y)
    K = len(mu)
    xi_filt = np.zeros((T, K))
    ll = 0.0

    # Initialize with ergodic distribution
    # (stationary distribution of the Markov chain)
    A = np.vstack([P_trans.T - np.eye(K), np.ones(K)])
    b = np.zeros(K + 1)
    b[-1] = 1.0
    xi_pred = np.linalg.lstsq(A, b, rcond=None)[0]

    for t in range(T):
        # Regime-conditional densities
        eta = np.array([
            stats.norm.pdf(y[t], loc=mu[j],
                           scale=np.sqrt(sigma2[j]))
            for j in range(K)
        ])

        # Joint prob: P(S_t=j, y_t | past)
        joint = eta * xi_pred
        marginal = joint.sum()

        # Update: P(S_t=j | y_1:t)
        xi_filt[t] = joint / marginal

        # Log-likelihood contribution
        ll += np.log(marginal)

        # Predict: P(S_{t+1}=j | y_1:t)
        xi_pred = P_trans.T @ xi_filt[t]

    return xi_filt, ll

Verify against statsmodels on the two-regime data we already generated:

# Use the statsmodels-estimated parameters
mu_hat = np.array([ms.params[2], ms.params[3]])
s2_hat = np.array([ms.params[4], ms.params[5]])
P_hat = np.array([
    [ms.params[0], 1 - ms.params[0]],
    [ms.params[1], 1 - ms.params[1]],
])

xi_ours, ll_ours = hamilton_filter(
    y_ms, mu_hat, s2_hat, P_hat,
)
print(f"Hamilton filter log-likelihood:")
print(f"  From scratch: {ll_ours:.2f}")
print(f"  statsmodels:  {ms.llf:.2f}")
xi_corr = np.corrcoef(
    xi_ours[:, 1],
    ms.filtered_marginal_probabilities[:, 1],
)[0, 1]
print(f"\nFiltered prob correlation (regime 1):")
print(f"  {xi_corr:.6f}")

Hamilton filter log-likelihood:
  From scratch: -289.82
  statsmodels:  -289.82

Filtered prob correlation (regime 1):
  1.000000

16.5.4 Verification against statsmodels

# Reuse y_ll generated above (local level data)
# Our implementation
alpha_ours, P_ours, ll_ours = kalman_local_level(y_ll, 0.25, 1.0)

# Statsmodels
uc_verify = UnobservedComponents(y_ll, level='local level').fit(
    start_params=[1.0, 0.25], disp=0,
)

print(f"Log-likelihood:")
print(f"  From scratch: {ll_ours:.2f}")
print(f"  statsmodels:  {uc_verify.llf:.2f}")
print(f"  (Difference is from the constant −T/2 ln(2π) and initialization)")
print(f"\nFiltered state correlation: "
      f"{np.corrcoef(alpha_ours, uc_verify.level.filtered)[0,1]:.6f}")

Log-likelihood:
  From scratch: -169.66
  statsmodels:  -168.12
  (Difference is from the constant −T/2 ln(2π) and initialization)

Filtered state correlation: 0.997340

16.5.5 Understanding the Kalman gain

fig, axes = plt.subplots(1, 2, figsize=(10, 4))

for ax, (se, sn, title) in zip(axes, [
    (0.1, 2.0, 'Noisy obs (σ²_ε=2, σ²_η=0.1)\n→ smooth (low gain)'),
    (2.0, 0.1, 'Noisy state (σ²_ε=0.1, σ²_η=2)\n→ follow data (high gain)'),
]):
    y_demo = np.cumsum(np.sqrt(se) * rng.standard_normal(100))
    y_demo += np.sqrt(sn) * rng.standard_normal(100)
    af, _, _ = kalman_local_level(y_demo, se, sn)
    ax.plot(y_demo, '.', markersize=3, alpha=0.5, label='Observations')
    ax.plot(af, linewidth=1.5, label='Filtered state')
    ax.set_title(title, fontsize=9)
    ax.legend(fontsize=8)

plt.tight_layout()
plt.show()

Figure 16.5: The Kalman gain depends on the signal-to-noise ratio. Left: high observation noise → low gain (trust the prediction, smooth the data). Right: high state noise → high gain (trust the data, follow it closely).

16.6 Computational Considerations

Model	Filter cost per step	Total	Memory
Local level ($m=1$)	$O(1)$	$O(T)$	$O(1)$
ARIMA($p$,$d$,$q$)	$O(m^2)$ where $m = \max(p, q+1)$	$O(Tm^2)$	$O(m^2)$
SARIMAX	$O((m \cdot s)^2)$ where $s =$ seasonal period	$O(T(ms)^2)$	$O((ms)^2)$
Markov ($K$ regimes)	$O(K^2)$ (Hamilton filter)	$O(TK^2)$	$O(K^2)$

The Kalman filter is $O(T)$ in the number of observations (each step is constant time for a fixed state dimension). The bottleneck for large state-space models is the state dimension $m$, not the number of observations.

16.7 Worked Example

Detecting a structural break using both UC and Markov switching.

# Data with a level shift at t=150
n_w = 300
y_shift = np.concatenate([
    rng.normal(0, 1, 150),    # regime 0: mean = 0
    rng.normal(3, 1, 150),    # regime 1: mean = 3
])

# Approach 1: UC local level (estimates gradual shift)
uc_shift = UnobservedComponents(y_shift, level='local level').fit(disp=0)

# Approach 2: Markov switching (estimates abrupt shift)
ms_shift = MarkovRegression(y_shift, k_regimes=2, trend='c').fit(disp=0)

print(f"UC local level: level noise = {uc_shift.params[1]:.4f}")
print(f"  (Large level noise = the UC model absorbs the shift as")
print(f"   a gradual random walk, not a sudden jump)")
print(f"\nMarkov switching:")
print(f"  Regime 0 mean: {ms_shift.params[2]:.2f}")
print(f"  Regime 1 mean: {ms_shift.params[3]:.2f}")
print(f"  (Correctly identifies the two levels)")

UC local level: level noise = 0.0272
  (Large level noise = the UC model absorbs the shift as
   a gradual random walk, not a sudden jump)

Markov switching:
  Regime 0 mean: 0.12
  Regime 1 mean: 3.00
  (Correctly identifies the two levels)

fig, axes = plt.subplots(3, 1, figsize=(10, 7), sharex=True)

axes[0].plot(y_shift, linewidth=0.5, color='C0')
axes[0].axvline(150, color='gray', linestyle='--', alpha=0.5)
axes[0].set_title('Data (level shift at t=150)')

axes[1].plot(uc_shift.level.smoothed, color='C1', linewidth=1.5)
axes[1].axvline(150, color='gray', linestyle='--', alpha=0.5)
axes[1].set_title('UC smoothed level (gradual)')

axes[2].fill_between(range(n_w),
                     ms_shift.smoothed_marginal_probabilities[:, 1],
                     alpha=0.5, color='C2')
axes[2].axvline(150, color='gray', linestyle='--', alpha=0.5)
axes[2].set_title('MS: P(regime 1) (abrupt)')
axes[2].set_xlabel('Time')

plt.tight_layout()
plt.show()

Figure 16.6: Two approaches to the same level shift. The UC model (middle) gradually tracks the shift through the stochastic level. The Markov switching model (bottom) identifies it as a regime change, with a sharp transition at t≈150.

Lesson: UC and Markov switching capture different kinds of change. UC is for gradual, continuous changes (trends, drifts). Markov switching is for abrupt, discrete changes (regime shifts). The choice depends on your application: if you believe the change was sudden (a policy intervention, a financial crisis), use Markov switching. If the change was gradual (demographic shifts, technological change), use UC.

16.8 Exercises

Exercise 16.1 ($\star\star$, diagnostic failure). Fit a local level model to data that actually has a deterministic linear trend ($y_t = 0.1t + \varepsilon_t$). What does the model do? What are the residual diagnostics like? What model should you use instead?

y_det = 0.1 * np.arange(200) + rng.standard_normal(200)
uc_det = UnobservedComponents(y_det, level='local level').fit(disp=0)

print(f"Level noise: {uc_det.params[1]:.4f}")
print(f"Obs noise:   {uc_det.params[0]:.4f}")
print(f"\nThe local level model absorbs the deterministic trend into")
print(f"the stochastic level — it 'works' but attributes the trend")
print(f"to random fluctuations rather than a systematic pattern.")
print(f"\nBetter model: level='local linear trend', which explicitly")
print(f"models the slope as a separate state variable.")

uc_llt2 = UnobservedComponents(y_det, level='local linear trend').fit(disp=0)
print(f"\nLocal linear trend AIC: {uc_llt2.aic:.1f}")
print(f"Local level AIC:        {uc_det.aic:.1f}")

Level noise: 0.1221
Obs noise:   0.8976

The local level model absorbs the deterministic trend into
the stochastic level — it 'works' but attributes the trend
to random fluctuations rather than a systematic pattern.

Better model: level='local linear trend', which explicitly
models the slope as a separate state variable.

Local linear trend AIC: 591.2
Local level AIC:        620.9

Exercise 16.2 ($\star\star$, implementation). Implement the Kalman filter for the local level model from scratch. Verify by comparing the filtered state and log-likelihood against UnobservedComponents.

# Already implemented above
y_test = np.cumsum(rng.standard_normal(80)) + 0.5*rng.standard_normal(80)
af, pf, ll = kalman_local_level(y_test, 1.0, 0.25)
uc_test = UnobservedComponents(y_test, level='local level').fit(
    start_params=[0.25, 1.0], disp=0)
print(f"State correlation: {np.corrcoef(af, uc_test.level.filtered)[0,1]:.6f}")
print(f"(Should be ≈ 1.0 — both are computing the same filter)")

State correlation: 0.999995
(Should be ≈ 1.0 — both are computing the same filter)

Exercise 16.3 ($\star\star$, comparison). Fit SARIMAX and ExponentialSmoothing to the same seasonal data. Compare AIC. Show that both are computing the same thing (state-space filtering) with different parameterizations.

from statsmodels.tsa.holtwinters import ExponentialSmoothing

T3 = 120
y3 = (np.linspace(100, 140, T3) + 8*np.sin(2*np.pi*np.arange(T3)/12)
      + 2*rng.standard_normal(T3))

ets3 = ExponentialSmoothing(y3, seasonal_periods=12, trend='add',
                            seasonal='add').fit()
sar3 = SARIMAX(y3, order=(0,1,1), seasonal_order=(0,1,1,12)).fit(disp=0)

print(f"ETS AIC:     {ets3.aic:.1f}")
print(f"SARIMAX AIC: {sar3.aic:.1f}")
print(f"\n(AICs are not directly comparable across model classes —")
print(f"ETS and SARIMAX compute likelihoods differently.)")
corr_fitted = np.corrcoef(ets3.fittedvalues, sar3.fittedvalues)[0, 1]
print(f"\nFitted values correlation: {corr_fitted:.4f}")
print(f"Both capture the same trend + seasonal pattern through")
print(f"different parameterizations of the same state-space framework.")

ETS AIC:     193.6
SARIMAX AIC: 496.0

(AICs are not directly comparable across model classes —
ETS and SARIMAX compute likelihoods differently.)

Fitted values correlation: 0.7964
Both capture the same trend + seasonal pattern through
different parameterizations of the same state-space framework.

Exercise 16.4 ($\star\star\star$, conceptual). The Kalman gain $K_t$ depends on the ratio of state noise to observation noise. As $\sigma^2_\eta / \sigma^2_\varepsilon \to \infty$, what happens to $K_t$? As it $\to 0$? Relate these limits to the exponential smoothing parameter $\alpha$.

When $\sigma^2_\eta \gg \sigma^2_\varepsilon$: the state is very unpredictable (changes rapidly) and observations are precise. The Kalman gain $K_t \to 1$: ignore the prediction, trust the data completely. This is $\alpha \to 1$ in exponential smoothing (the latest observation gets all the weight).

When $\sigma^2_\eta \ll \sigma^2_\varepsilon$: the state is very stable (barely changes) but observations are noisy. The Kalman gain $K_t \to 0$: ignore the data, trust the prediction (which is essentially the previous state). This is $\alpha \to 0$ (heavy smoothing, historical observations retain influence).

The steady-state Kalman gain for the local level model is: $K^* = \frac{-\sigma^2_\varepsilon + \sqrt{\sigma^4_\varepsilon + 4\sigma^2_\eta\sigma^2_\varepsilon}}{2\sigma^2_\varepsilon}$

This equals the exponential smoothing parameter $\alpha$ exactly. The “choice” of $\alpha$ in exponential smoothing is really a choice of the signal-to-noise ratio in the underlying state-space model.

16.9 Bibliographic Notes

The Kalman filter: Kalman (1960). The connection to recursive Bayesian estimation was recognized immediately. The state-space representation of time series models: Harvey (1989), which remains the standard reference.

ARIMA in state-space form: Durbin and Koopman (2012) provide the canonical treatment. Statsmodels’ implementation follows this closely.

The connection between exponential smoothing and state-space models: Hyndman et al. (2008). Every ETS model corresponds to a linear Gaussian state-space model; the smoothing parameters are functions of the noise variances.

Markov switching: Hamilton (1989) introduced the Hamilton filter and the regime-switching model. Kim and Nelson (1999) extend it to state-space models with regime switching.

The Rauch-Tung-Striebel (RTS) smoother: Rauch, Tung, and Striebel (1965). This is the backward pass that produces the smoothed estimates $\boldsymbol{\alpha}_{t|T}$.

Statsmodels’ state-space framework: Fulton (2015). The implementation uses Cython-compiled Kalman filter routines for performance.

--- title: "State Space Models, Kalman Filtering, and Regime Switching" --- ```{python} #| label: imports import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm from statsmodels.tsa.statespace.sarimax import SARIMAX from statsmodels.tsa.statespace.structural import UnobservedComponents from statsmodels.tsa.statespace.dynamic_factor import DynamicFactor from statsmodels.tsa.regime_switching.markov_regression import MarkovRegression from statsmodels.tsa.regime_switching.markov_autoregression import ( MarkovAutoregression, ) from statsmodels.tsa.statespace.mlemodel import MLEModel from scipy import stats from scipy.optimize import minimize as sp_minimize plt.style.use('assets/book.mplstyle') rng = np.random.default_rng(42) ``` ### Notation for this chapter | Symbol | Meaning | statsmodels | |--------|---------|-------------| | $\boldsymbol{\alpha}_t$ | State vector at time $t$ | | | $\mathbf{T}$ | Transition matrix | `transition` | | $\mathbf{Z}$ | Observation (design) matrix | `design` | | $\mathbf{Q}$ | State noise covariance | `state_cov` | | $\mathbf{H}$ | Observation noise covariance | `obs_cov` | | $\boldsymbol{\alpha}_{t|t-1}$ | Predicted state (before seeing $y_t$) | | | $\boldsymbol{\alpha}_{t|t}$ | Filtered state (after seeing $y_t$) | `.filtered_state` | | $\boldsymbol{\alpha}_{t|T}$ | Smoothed state (using all data) | `.smoothed_state` | | $\mathbf{K}_t$ | Kalman gain | | | $\mathbf{v}_t$ | Prediction error (innovation) | | | $S_t$ | Regime indicator (Markov switching) | `.smoothed_marginal_probabilities` | ## Motivation {#sec-ch16-motivation} Chapter 15 introduced ARIMA and exponential smoothing as separate methods. This chapter reveals that they are *both* special cases of a single, more general framework: the **state-space model**. The state-space framework matters for three reasons: 1. **Unification.** ARIMA, exponential smoothing, structural time series (trend + cycle + seasonal), and dynamic factor models are all state-space models with specific parameter restrictions. Once you understand the framework, you understand them all. 2. **The Kalman filter.** All state-space models are estimated via the Kalman filter --- an algorithm that processes observations one at a time, maintaining an estimate of the unobserved "state" and updating it as new data arrives. This is how `statsmodels.tsa` fits every model in Chapter 15 (and this chapter). 3. **Flexibility.** The state-space framework lets you build custom models by specifying how the state evolves and how observations relate to the state. Statsmodels' `MLEModel` class exposes this. This chapter also covers **Markov regime switching** --- models where the parameters change discretely between "regimes" (e.g., bull vs bear markets, recession vs expansion). These are a different kind of hidden state: discrete rather than continuous. ## Mathematical Foundation {#sec-ch16-math} ### The linear Gaussian state-space model The state-space model has two equations: **Transition equation** (how the state evolves): $$ \boldsymbol{\alpha}_{t+1} = \mathbf{T}\boldsymbol{\alpha}_t + \mathbf{R}\boldsymbol{\eta}_t, \quad \boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}) $$ {#eq-state} **Observation equation** (how we observe the state): $$ \mathbf{y}_t = \mathbf{Z}\boldsymbol{\alpha}_t + \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{H}) $$ {#eq-obs} **What these mean:** - $\boldsymbol{\alpha}_t$ is the *state*: the unobserved quantity we want to track (e.g., the "true" level of a noisy signal, the trend, the seasonal component). - $\mathbf{y}_t$ is the *observation*: what we actually measure. It is a noisy linear function of the state. - $\mathbf{T}$ determines how the state evolves over time. - $\mathbf{Z}$ determines how the state maps to observations. - $\mathbf{Q}$ is how much the state fluctuates (state noise). - $\mathbf{H}$ is how noisy the measurements are (observation noise). **Example: the local level model.** The simplest state-space model: the "state" is the current level of the series, which evolves as a random walk, and we observe it with noise. $$ \alpha_{t+1} = \alpha_t + \eta_t, \quad y_t = \alpha_t + \varepsilon_t. $$ Here $T = Z = 1$, $Q = \sigma^2_\eta$ (state noise), $H = \sigma^2_\varepsilon$ (observation noise). The Kalman filter estimates $\alpha_t$ (the true level) from the noisy observations $y_t$. ### Why ARIMA and ETS are state-space models **ARIMA as state-space:** An ARIMA(1,1,1) model $\Delta y_t = \phi \Delta y_{t-1} + \varepsilon_t + \theta \varepsilon_{t-1}$ can be written in state-space form with state $\boldsymbol{\alpha}_t = (\Delta y_t, \varepsilon_t)^\top$ and appropriate $\mathbf{T}$ and $\mathbf{Z}$ matrices. This is exactly what `statsmodels.tsa.statespace.sarimax.SARIMAX` does internally. **ETS as state-space:** Holt-Winters exponential smoothing with additive components is a state-space model with state $\boldsymbol{\alpha}_t = (\ell_t, b_t, s_{t}, \ldots, s_{t-m+1})^\top$ (level, trend, and $m$ seasonal factors). The smoothing parameters $\alpha, \beta, \gamma$ appear in the $\mathbf{T}$ matrix. The Kalman filter reduces to the exponential smoothing recursions. **Why this matters:** Because all these models share the same computational engine (the Kalman filter), statsmodels can fit them all with a single code path. Understanding the Kalman filter means understanding what `SARIMAX.fit()`, `ExponentialSmoothing.fit()`, and `UnobservedComponents.fit()` are doing under the hood. ### The Kalman filter: predict, then update The Kalman filter processes observations sequentially. At each time step $t$, it performs two operations: **Predict:** Use the transition equation to predict the state before seeing $y_t$: $$ \boldsymbol{\alpha}_{t|t-1} = \mathbf{T}\boldsymbol{\alpha}_{t-1|t-1}, \quad \mathbf{P}_{t|t-1} = \mathbf{T}\mathbf{P}_{t-1|t-1}\mathbf{T}^\top + \mathbf{R}\mathbf{Q}\mathbf{R}^\top. $$ **Update:** Incorporate the new observation $y_t$ to correct the prediction: $$ \mathbf{v}_t = \mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1} \quad \text{(prediction error)} $$ $$ \mathbf{F}_t = \mathbf{Z}\mathbf{P}_{t|t-1}\mathbf{Z}^\top + \mathbf{H} \quad \text{(error variance)} $$ $$ \mathbf{K}_t = \mathbf{P}_{t|t-1}\mathbf{Z}^\top\mathbf{F}_t^{-1} \quad \text{(Kalman gain)} $$ $$ \boldsymbol{\alpha}_{t|t} = \boldsymbol{\alpha}_{t|t-1} + \mathbf{K}_t\mathbf{v}_t \quad \text{(filtered state)} $$ **The Kalman gain $\mathbf{K}_t$** is the key quantity. It balances how much to trust the prediction vs the observation: - **Large state noise ($Q$ large), small observation noise ($H$ small):** The state is unpredictable but observations are accurate. $\mathbf{K}_t \to \mathbf{I}$: trust the data, not the prediction. This corresponds to $\alpha \to 1$ in exponential smoothing. - **Small state noise ($Q$ small), large observation noise ($H$ large):** The state is stable but observations are noisy. $\mathbf{K}_t \to \mathbf{0}$: trust the prediction, smooth out the noise. This corresponds to $\alpha \to 0$ in exponential smoothing. **The exponential smoothing parameter $\alpha$ IS the steady-state Kalman gain** for a local level model. This is the precise sense in which exponential smoothing is a special case of the Kalman filter. ### Deriving the update from joint Gaussian conditioning The Kalman update formulas are not arbitrary --- they follow from a single fact about multivariate Gaussians. If a joint distribution is Gaussian: $$ \begin{pmatrix} \boldsymbol{\alpha}_t \\ \mathbf{y}_t \end{pmatrix} \Bigg| \, y_1, \ldots, y_{t-1} \sim \mathcal{N}\!\left( \begin{pmatrix} \boldsymbol{\alpha}_{t|t-1} \\ \mathbf{Z}\boldsymbol{\alpha}_{t|t-1} \end{pmatrix}, \begin{pmatrix} \mathbf{P}_{t|t-1} & \mathbf{P}_{t|t-1}\mathbf{Z}^\top \\ \mathbf{Z}\mathbf{P}_{t|t-1} & \mathbf{F}_t \end{pmatrix} \right) $$ where $\mathbf{F}_t = \mathbf{Z}\mathbf{P}_{t|t-1}\mathbf{Z}^\top + \mathbf{H}$, then the conditional distribution $\boldsymbol{\alpha}_t \mid y_1, \ldots, y_t$ is Gaussian with: $$ \text{mean:}\quad \boldsymbol{\alpha}_{t|t-1} + \mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1} (\mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1}) $$ $$ \text{covariance:}\quad \mathbf{P}_{t|t-1} - \mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1} \mathbf{Z}\mathbf{P}_{t|t-1} $$ Identifying terms: the factor $\mathbf{P}_{t|t-1}\mathbf{Z}^\top \mathbf{F}_t^{-1}$ is the Kalman gain $\mathbf{K}_t$, the residual $\mathbf{y}_t - \mathbf{Z}\boldsymbol{\alpha}_{t|t-1}$ is the innovation $\mathbf{v}_t$, and the covariance simplifies to $(\mathbf{I} - \mathbf{K}_t\mathbf{Z})\mathbf{P}_{t|t-1}$. Every formula in the predict-update cycle drops out of this single conditioning identity. The Kalman filter is not a clever heuristic --- it is exact Bayesian inference for a model that happens to stay Gaussian at every step. ### Why the Kalman filter is optimal For the linear Gaussian model (@eq-state, @eq-obs), the Kalman filter produces the *minimum-variance unbiased* state estimate given all observations up to time $t$. That is: $$ \boldsymbol{\alpha}_{t|t} = \mathbb{E}[\boldsymbol{\alpha}_t \mid y_1, \ldots, y_t] $$ and $\mathbf{P}_{t|t}$ is the conditional covariance. No other linear estimator can achieve a smaller mean squared error. **Why:** The derivation above makes this immediate. The predict-update cycle is recursive application of Bayes' rule to the Gaussian prior (prediction) and Gaussian likelihood (observation). For Gaussians, the Bayesian posterior is exact and closed-form --- the Kalman filter computes this closed form at each step. ### How MLE works for state-space models The Kalman filter also produces the *likelihood* of the data given the model parameters. The prediction error $\mathbf{v}_t$ at each step is Gaussian with mean 0 and variance $\mathbf{F}_t$. The log-likelihood is: $$ \ell = -\frac{1}{2}\sum_{t=1}^T \left[\ln|\mathbf{F}_t| + \mathbf{v}_t^\top \mathbf{F}_t^{-1} \mathbf{v}_t\right] $$ {#eq-kalman-ll} This is the *prediction error decomposition*. Each term is the log-density of one observation's prediction error. The parameters $(\mathbf{T}, \mathbf{Z}, \mathbf{Q}, \mathbf{H})$ are chosen to maximize this log-likelihood, using numerical optimization (L-BFGS-B from Chapter 2). **This is what `SARIMAX.fit()` does:** it runs the Kalman filter for a given set of ARIMA parameters, computes @eq-kalman-ll, and uses L-BFGS-B to find the parameters that maximize it. ### EM as an alternative to direct optimization Direct numerical optimization of the innovations likelihood can be fragile --- the surface is often flat in some directions, and optimizers can wander into regions where a covariance matrix loses positive definiteness. The **EM algorithm** offers a more stable alternative for state-space models. It treats the states $\boldsymbol{\alpha}_{1:T}$ as missing data: **E-step:** Run the Kalman smoother to compute $\mathbb{E}[\boldsymbol{\alpha}_t \mid y_{1:T}]$ and the smoothed covariances under current parameters. These are the sufficient statistics for the complete-data likelihood. **M-step:** Update the system matrices analytically. For the local level model, the updates are simple: $$ \hat{\sigma}^2_\varepsilon = \frac{1}{T}\sum_{t=1}^T \left[(y_t - \alpha_{t|T})^2 + P_{t|T}\right] $$ $$ \hat{\sigma}^2_\eta = \frac{1}{T-1}\sum_{t=1}^{T-1} \left[(\alpha_{t+1|T} - \alpha_{t|T})^2 + P_{t+1|T} + P_{t|T} - 2C_{t+1,t|T}\right] $$ where $C_{t+1,t|T}$ is the lag-one smoothed covariance (from the RTS backward pass). Each iteration increases the likelihood, and the algorithm avoids matrix-parameter constrained optimization entirely. Statsmodels uses direct optimization by default, but `fit(method='em')` is available for `UnobservedComponents` when convergence is difficult. ### Smoothing: using the future to improve the past The Kalman *filter* uses only observations up to time $t$ to estimate $\boldsymbol{\alpha}_t$. The Kalman *smoother* uses ALL observations $y_1, \ldots, y_T$ to re-estimate every state: $$ \boldsymbol{\alpha}_{t|T} = \mathbb{E}[\boldsymbol{\alpha}_t \mid y_1, \ldots, y_T] $$ The standard algorithm is the **Rauch-Tung-Striebel (RTS) smoother**. After the forward Kalman filter pass, a backward pass computes: $$ \mathbf{L}_t = \mathbf{P}_{t|t}\mathbf{T}^\top \mathbf{P}_{t+1|t}^{-1} \quad \text{(smoother gain)} $$ $$ \boldsymbol{\alpha}_{t|T} = \boldsymbol{\alpha}_{t|t} + \mathbf{L}_t(\boldsymbol{\alpha}_{t+1|T} - \boldsymbol{\alpha}_{t+1|t}) $$ $$ \mathbf{P}_{t|T} = \mathbf{P}_{t|t} + \mathbf{L}_t(\mathbf{P}_{t+1|T} - \mathbf{P}_{t+1|t})\mathbf{L}_t^\top $$ for $t = T-1, T-2, \ldots, 1$, starting from the final filtered values $\boldsymbol{\alpha}_{T|T}$ and $\mathbf{P}_{T|T}$. The smoother gain $\mathbf{L}_t$ controls how much future information corrects the filtered estimate at time $t$. The correction is the difference between what time $t{+}1$ turned out to be (smoothed) and what we predicted it would be (one-step-ahead). Smoothed estimates are always at least as good as filtered estimates (lower MSE) because they use strictly more data. Statsmodels gives both: `.filtered_state` and `.smoothed_state`. ### Markov regime switching Sometimes the *parameters* of a time series change over time. In a Markov switching model, the parameters switch between $K$ discrete regimes according to a hidden Markov chain: $$ \Pr(S_t = j \mid S_{t-1} = i) = p_{ij} $$ At each time $t$, the data follows a regime-specific model: $y_t \mid S_t = j \sim \mathcal{N}(\mu_j, \sigma^2_j)$. **The Hamilton filter** (Hamilton, 1989) estimates the regime probabilities $\Pr(S_t = j \mid y_1, \ldots, y_t)$ using the following recursion. Let $\boldsymbol{\xi}_{t|t} = (\Pr(S_t = 1 \mid \mathcal{Y}_t), \ldots, \Pr(S_t = K \mid \mathcal{Y}_t))^\top$ be the vector of filtered regime probabilities. **Predict:** Propagate through the transition matrix $\mathbf{P}$: $$ \boldsymbol{\xi}_{t|t-1} = \mathbf{P}^\top \boldsymbol{\xi}_{t-1|t-1} $$ **Update:** Weight by the regime-conditional likelihoods: $$ \xi_{t|t}^{(j)} = \frac{ f(y_t \mid S_t = j) \, \xi_{t|t-1}^{(j)} }{ \sum_{k=1}^K f(y_t \mid S_t = k) \, \xi_{t|t-1}^{(k)} } $$ The denominator is the marginal density of $y_t$ --- summing its log across $t$ gives the log-likelihood. This is Bayes' rule applied at each step, exactly as the Kalman filter applies Bayes' rule for continuous states. The computational cost is $O(K^2)$ per step (the matrix-vector product), compared to $O(m^3)$ for the Kalman filter with $m$-dimensional continuous state. **Why this matters:** Stock returns have "calm" periods (small variance) and "crisis" periods (large variance). GDP growth switches between expansion and recession. Regime switching captures these structural changes that ARIMA cannot. ## The Algorithm and SciPy Implementation {#sec-ch16-algorithm} ### The Kalman filter algorithm ``` Input: observations y_1, ..., y_T; system matrices T, Z, Q, H, R initial state α_{1|0} and covariance P_{1|0} Output: filtered states α_{t|t}, prediction errors v_t, log-likelihood ℓ = 0 for t = 1, ..., T: # Prediction error v_t = y_t − Z α_{t|t−1} F_t = Z P_{t|t−1} Z' + H # Kalman gain K_t = P_{t|t−1} Z' F_t⁻¹ # Update (filtered) state α_{t|t} = α_{t|t−1} + K_t v_t P_{t|t} = (I − K_t Z) P_{t|t−1} # Predict next state α_{t+1|t} = T α_{t|t} P_{t+1|t} = T P_{t|t} T' + R Q R' # Log-likelihood contribution ℓ += −0.5 [ln|F_t| + v_t' F_t⁻¹ v_t] return {α_{t|t}}, ℓ ``` ### `UnobservedComponents`: structural time series `UnobservedComponents` decomposes a time series into interpretable parts using a state-space model. Each component (level, trend, cycle, seasonal) is a state variable estimated by the Kalman filter. ```{python} #| label: uc-demo # Simulate: random walk level + stochastic cycle n = 200 level_true = np.cumsum(0.3 * rng.standard_normal(n)) cycle_true = 5 * np.sin(2 * np.pi * np.arange(n) / 20) y_uc = level_true + cycle_true + rng.standard_normal(n) # Fit unobserved components model # level='local level': the level is a random walk # cycle=True: include a stochastic cycle component # stochastic_cycle=True: the cycle amplitude and phase can change uc = UnobservedComponents( y_uc, level='local level', cycle=True, stochastic_cycle=True, ).fit(disp=0) print(f"Estimated parameters:") print(f" Observation noise (sigma2_eps): {uc.params[0]:.4f}") print(f" Level noise (sigma2_eta): {uc.params[1]:.4f}") print(f" Cycle noise: {uc.params[2]:.4f}") print(f" Cycle frequency: {uc.params[3]:.4f}") print(f" Cycle period: {2*np.pi/uc.params[3]:.1f} time units") ``` ```{python} #| label: fig-uc #| fig-cap: "Unobserved components decomposition. The Kalman smoother separates the noisy observations into a trend (level) and a cyclical component. The smoothed estimates use all data (past and future) and are therefore smoother than the filtered estimates." fig, axes = plt.subplots(3, 1, figsize=(10, 8), sharex=True) axes[0].plot(y_uc, linewidth=0.5, color='C0', alpha=0.7, label='Observed') axes[0].plot(level_true + cycle_true, '--', color='gray', linewidth=0.5, label='True signal') axes[0].legend() axes[0].set_title('Observed data') axes[1].plot(uc.level.smoothed, color='C1', linewidth=1.5, label='Smoothed level') axes[1].plot(level_true, '--', color='gray', linewidth=0.5, label='True level') axes[1].legend() axes[1].set_title('Level component') axes[2].plot(uc.cycle.smoothed, color='C2', linewidth=1.5, label='Smoothed cycle') axes[2].plot(cycle_true, '--', color='gray', linewidth=0.5, label='True cycle') axes[2].legend() axes[2].set_title('Cycle component') plt.tight_layout() plt.show() ``` **What to notice:** The smoothed level tracks the true random walk closely, and the smoothed cycle recovers the sinusoidal pattern. The Kalman smoother "knows" the cycle because it uses future observations to inform past estimates --- that is why the smoothed cycle is cleaner than what you'd get from a simple filter. ### Custom state-space models via `MLEModel` `UnobservedComponents` covers common decompositions, but statsmodels' real power is `MLEModel` --- the base class that lets you define *any* linear Gaussian state-space model by specifying the system matrices directly. Below, we build a local level model from scratch as an `MLEModel` subclass to show the machinery. ```{python} #| label: mlemodel-subclass class LocalLevel(MLEModel): """Local level model: y_t = alpha_t + eps_t, alpha_{t+1} = alpha_t + eta_t.""" _start_params = [1.0, 1.0] _param_names = ['sigma2_eps', 'sigma2_eta'] def __init__(self, endog): # k_states=1: one state variable (the level) super().__init__( endog, k_states=1, initialization='diffuse', ) # Fix the matrices that don't depend on params self['transition', 0, 0] = 1.0 # T = 1 self['design', 0, 0] = 1.0 # Z = 1 self['selection', 0, 0] = 1.0 # R = 1 def update(self, params, **kwargs): """Called by the optimizer at each step.""" params = super().update(params, **kwargs) # params[0] = obs noise, params[1] = state noise self['obs_cov', 0, 0] = params[0] # H self['state_cov', 0, 0] = params[1] # Q ll_custom = LocalLevel(y_uc).fit(disp=0) # Compare against UC local level fit on same data uc_ll_check = UnobservedComponents( y_uc, level='local level', ).fit(disp=0) print(f"Custom MLEModel estimates:") print(f" sigma2_eps = {ll_custom.params[0]:.4f}") print(f" sigma2_eta = {ll_custom.params[1]:.4f}") print(f" Log-likelihood: {ll_custom.llf:.2f}") print(f"\nMatches UC local level: " f"{abs(ll_custom.llf - uc_ll_check.llf) < 1.0}") ``` The pattern is always the same: subclass `MLEModel`, set fixed matrices in `__init__`, and update parameter-dependent matrices in `update`. This is how every model in `statsmodels.tsa.statespace` works internally --- `UnobservedComponents`, `SARIMAX`, and `DynamicFactor` are all `MLEModel` subclasses with pre-built `update` methods. ### `DynamicFactor`: shared latent dynamics When multiple time series move together, a **dynamic factor model** explains the co-movement through a small number of latent factors. Each observed series is a linear combination of the factors plus idiosyncratic noise --- a state-space model where the state is the vector of factors. ```{python} #| label: dynamic-factor # Simulate 3 series driven by 1 common factor n_df = 200 factor_true = np.cumsum(0.3 * rng.standard_normal(n_df)) loadings_true = np.array([1.0, 0.7, -0.5]) Y_df = np.column_stack([ l * factor_true + rng.standard_normal(n_df) for l in loadings_true ]) df_mod = DynamicFactor( Y_df, k_factors=1, factor_order=1, ).fit(disp=0) print(f"Estimated factor loadings:") for i in range(3): print(f" Series {i}: {df_mod.params[i]:.3f} " f"(true: {loadings_true[i]:.1f})") fac_corr = np.corrcoef( df_mod.factors.smoothed[0], factor_true, )[0, 1] print(f"\nFactor-true correlation: {fac_corr:.4f}") ``` ### `MarkovRegression`: regime switching ```{python} #| label: markov-demo # Simulate two-regime data n_ms = 300 # Generate regime sequence (hidden Markov chain) regime = np.zeros(n_ms, dtype=int) for t in range(1, n_ms): if regime[t-1] == 0: regime[t] = 1 if rng.random() < 0.02 else 0 # rarely switch out of 0 else: regime[t] = 0 if rng.random() < 0.05 else 1 # switch back more often # Regime-specific parameters y_ms = np.where( regime == 0, rng.normal(0.05, 0.5, n_ms), # regime 0: calm, positive mean rng.normal(-0.02, 2.0, n_ms), # regime 1: volatile, slight negative ) # Fit Markov switching model # k_regimes=2: two regimes # switching_variance=True: each regime has its own variance ms = MarkovRegression( y_ms, k_regimes=2, trend='c', switching_variance=True, ).fit(disp=0) print(f"Regime 0: mean = {ms.params[2]:.3f}, " f"sigma = {np.sqrt(ms.params[4]):.3f}") print(f"Regime 1: mean = {ms.params[3]:.3f}, " f"sigma = {np.sqrt(ms.params[5]):.3f}") print(f"P(stay in 0): {ms.params[0]:.3f}") print(f"P(stay in 1): {1-ms.params[1]:.3f}") ``` ```{python} #| label: fig-regime #| fig-cap: "Markov switching model applied to two-regime data. Top: the data, colored by true regime. Bottom: the model's estimated probability of being in regime 1 (high volatility). The model correctly identifies most regime switches." fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 5), sharex=True) # Data colored by true regime colors = np.where(regime == 0, 'C0', 'C1') for t in range(n_ms): ax1.plot(t, y_ms[t], '.', color=colors[t], markersize=2) ax1.set_ylabel('y') ax1.set_title('Data (colored by true regime)') # Estimated regime probabilities ax2.fill_between(range(n_ms), ms.smoothed_marginal_probabilities[:, 1], alpha=0.5, color='C1', label='P(regime 1)') ax2.plot(regime * 0.9 + 0.05, linewidth=0.5, color='C3', alpha=0.5, label='True regime') ax2.set_ylabel('Probability') ax2.set_xlabel('Time') ax2.legend() ax2.set_title('Estimated regime probabilities') plt.tight_layout() plt.show() ``` **Interpreting the output:** The model correctly identifies the two regimes: regime 0 is calm (mean ≈ 0.05, low variance) and regime 1 is volatile (mean ≈ 0, high variance). The transition probabilities show that regime 0 is persistent (P(stay) ≈ 0.98) while regime 1 is less so (P(stay) ≈ 0.95). ### `MarkovAutoregression`: regime-dependent dynamics `MarkovRegression` switches the mean and variance across regimes but assumes no serial dependence within each regime. When the *autoregressive structure* itself changes between regimes, use `MarkovAutoregression`. This models $y_t$ as an AR process whose coefficients, intercept, and variance all depend on the current regime $S_t$. ```{python} #| label: markov-ar # Simulate AR(1) with regime-switching persistence n_ar = 400 regime_ar = np.zeros(n_ar, dtype=int) for t in range(1, n_ar): if regime_ar[t-1] == 0: regime_ar[t] = int(rng.random() < 0.03) else: regime_ar[t] = int(rng.random() > 0.05) # Regime 0: low persistence, regime 1: high persistence y_ar = np.zeros(n_ar) for t in range(1, n_ar): phi = 0.3 if regime_ar[t] == 0 else 0.9 sigma = 0.5 if regime_ar[t] == 0 else 1.5 y_ar[t] = phi * y_ar[t-1] + sigma * rng.standard_normal() mar = MarkovAutoregression( y_ar, k_regimes=2, order=1, switching_ar=True, switching_variance=True, ).fit(disp=0) print(f"Regime 0: AR(1) coeff = {mar.params[4]:.3f}, " f"sigma = {np.sqrt(mar.params[6]):.3f}") print(f"Regime 1: AR(1) coeff = {mar.params[5]:.3f}, " f"sigma = {np.sqrt(mar.params[7]):.3f}") ``` The key difference from `MarkovRegression`: here the model captures the idea that not just the *level* changes across regimes, but the *dynamics* --- how strongly the series depends on its own past. ## Variations, Diagnostics, and Pitfalls {#sec-ch16-variations} ### Choosing the right structural model | Model | State | Best for | |---|---|---| | Local level | Random walk | Noisy signal tracking | | Local linear trend | Level + slope | Trend extrapolation | | Cycle | Stochastic sinusoid | Business cycles | | Seasonal | $m$ seasonal factors | Monthly/quarterly data | | Combination | Multiple components | Decomposition | ```{python} #| label: model-comparison # Compare local level vs local linear trend uc_ll = UnobservedComponents(y_uc, level='local level').fit(disp=0) uc_llt = UnobservedComponents(y_uc, level='local linear trend').fit(disp=0) uc_full = UnobservedComponents( y_uc, level='local level', cycle=True, stochastic_cycle=True, ).fit(disp=0) print(f"{'Model':<25} {'AIC':>8} {'BIC':>8}") print(f"{'Local level':<25} {uc_ll.aic:>8.1f} {uc_ll.bic:>8.1f}") print(f"{'Local linear trend':<25} {uc_llt.aic:>8.1f} {uc_llt.bic:>8.1f}") print(f"{'Level + cycle':<25} {uc_full.aic:>8.1f} {uc_full.bic:>8.1f}") ``` ### Filtering vs smoothing: what the future buys you The difference between filtered and smoothed states is diagnostic itself. Large discrepancies indicate periods where future data substantially revised the filter's real-time estimate --- often around structural changes or outliers. ```{python} #| label: fig-filter-vs-smooth #| fig-cap: "Filtered vs smoothed level estimates for the UC model. The smoother (which uses future data) is less volatile. The largest discrepancies occur where the signal changes direction --- exactly where hindsight helps most." fig, (ax1, ax2) = plt.subplots( 2, 1, figsize=(10, 5), sharex=True, ) ax1.plot(y_uc, linewidth=0.3, alpha=0.5, label='Observed') ax1.plot( uc.level.filtered, linewidth=1.0, label='Filtered', color='C1', ) ax1.plot( uc.level.smoothed, linewidth=1.0, label='Smoothed', color='C2', linestyle='--', ) ax1.legend(fontsize=8) ax1.set_title('Filtered vs smoothed level') # The revision: how much the smoother changed # the filter's estimate revision = uc.level.smoothed - uc.level.filtered ax2.fill_between( range(len(revision)), revision, alpha=0.5, color='C3', ) ax2.axhline(0, color='gray', linewidth=0.5) ax2.set_ylabel('Smoothed − Filtered') ax2.set_xlabel('Time') ax2.set_title('Smoother revision (large = hindsight helped)') plt.tight_layout() plt.show() ``` ### Diagnostics: standardized prediction errors For a well-specified state-space model, the prediction errors $v_t / \sqrt{F_t}$ should be i.i.d. $\mathcal{N}(0, 1)$. Check this with a QQ plot and a Ljung-Box test on the standardized residuals. ```{python} #| label: fig-ss-diagnostics #| fig-cap: "State-space model diagnostics. Left: standardized prediction errors should look like white noise. Right: QQ plot checks that the errors are approximately normal." std_resid = uc.standardized_forecasts_error[0] fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 4)) ax1.plot(std_resid, linewidth=0.5) ax1.axhline(0, color='gray', linewidth=0.5) ax1.axhline(2, color='C1', linestyle='--', alpha=0.3) ax1.axhline(-2, color='C1', linestyle='--', alpha=0.3) ax1.set_xlabel('Time') ax1.set_ylabel('Standardized error') ax1.set_title('Prediction errors (should be white noise)') stats.probplot(std_resid, plot=ax2) ax2.set_title('QQ plot (should be linear)') plt.tight_layout() plt.show() # Ljung-Box test from statsmodels.stats.diagnostic import acorr_ljungbox lb = acorr_ljungbox(std_resid, lags=10, return_df=True) print(f"Ljung-Box: all p > 0.05? {(lb['lb_pvalue'] > 0.05).all()}") ``` ### Common pitfalls 1. **Convergence warnings:** State-space MLE is nonconvex. If `fit()` warns about convergence, try different starting values with `start_params` or a different optimizer with `method`. 2. **Negative variance estimates:** Sometimes the optimizer finds $\hat{\sigma}^2 < 0$, which is nonsensical. This usually means the model is overparameterized --- simplify the component structure. 3. **Non-positive-definite Hessian:** The variance-covariance of the parameter estimates is unreliable. This happens at boundary solutions (e.g., a variance component estimated at exactly 0). ## From-Scratch Implementation {#sec-ch16-scratch} ### Kalman filter for the local level model ```{python} #| label: kalman-scratch def kalman_local_level(y, sigma2_eta, sigma2_eps): """Kalman filter for the local level model. State equation: α_{t+1} = α_t + η_t, η_t ~ N(0, σ²_η) Observation: y_t = α_t + ε_t, ε_t ~ N(0, σ²_ε) This is Algorithm 16.1 specialized to T=Z=R=1. Parameters ---------- y : array (T,) Observations. sigma2_eta : float State noise variance (how much the level fluctuates). sigma2_eps : float Observation noise variance (how noisy the measurements are). Returns ------- alpha_filt : array (T,) Filtered state estimates. P_filt : array (T,) Filtered state variances. ll : float Log-likelihood. """ T = len(y) alpha_filt = np.zeros(T) P_filt = np.zeros(T) alpha_pred = np.zeros(T) P_pred = np.zeros(T) ll = 0 # Initialize: diffuse prior (large uncertainty) alpha_pred[0] = y[0] P_pred[0] = sigma2_eta + sigma2_eps for t in range(T): # Prediction error: how far off was our prediction? v = y[t] - alpha_pred[t] F = P_pred[t] + sigma2_eps # prediction error variance # Kalman gain: how much to trust the new observation K = P_pred[t] / F # Update: correct the prediction using the observation alpha_filt[t] = alpha_pred[t] + K * v P_filt[t] = (1 - K) * P_pred[t] # Predict next step if t < T - 1: alpha_pred[t+1] = alpha_filt[t] # state doesn't drift P_pred[t+1] = P_filt[t] + sigma2_eta # but uncertainty grows # Log-likelihood: prediction error decomposition ll -= 0.5 * (np.log(2 * np.pi) + np.log(F) + v**2 / F) return alpha_filt, P_filt, ll ``` ### Innovations likelihood from scratch The from-scratch Kalman filter above already computes the log-likelihood, but let us isolate the innovations likelihood to make the prediction-error decomposition explicit. This function takes the system matrices and returns only the log-likelihood --- the form an optimizer would call repeatedly during MLE. ```{python} #| label: ll-data # Generate local level data for the likelihood examples y_ll = (np.cumsum(0.5 * rng.standard_normal(100)) + rng.standard_normal(100)) ``` ```{python} #| label: innovations-ll def innovations_loglik(params, y): """Innovations log-likelihood for local level model. Implements @eq-kalman-ll directly: the log-likelihood is a sum of Gaussian log-densities evaluated at the prediction errors. Parameters ---------- params : array (2,) [sigma2_eta, sigma2_eps]. y : array (T,) Observations. Returns ------- neg_ll : float Negative log-likelihood (for minimization). """ sigma2_eta, sigma2_eps = params T = len(y) # Initialize a = y[0] # state prediction P = sigma2_eta + sigma2_eps # state variance ll = 0.0 for t in range(T): v = y[t] - a # innovation F = P + sigma2_eps # innovation variance ll -= 0.5 * (np.log(2 * np.pi) + np.log(F) + v**2 / F) K = P / F # Kalman gain a = a + K * v # filtered state P = (1 - K) * P + sigma2_eta # predict next return -ll # return negative for minimization opt = sp_minimize( innovations_loglik, x0=[1.0, 1.0], args=(y_ll,), method='L-BFGS-B', bounds=[(1e-6, None), (1e-6, None)], ) print(f"MLE via innovations likelihood:") print(f" sigma2_eta = {opt.x[0]:.4f}") print(f" sigma2_eps = {opt.x[1]:.4f}") print(f" Log-likelihood = {-opt.fun:.2f}") ``` ### Hamilton filter from scratch The Hamilton filter for a two-regime model with switching mean and variance. This mirrors the Kalman filter's predict-update logic, but for discrete hidden states. ```{python} #| label: hamilton-scratch def hamilton_filter(y, mu, sigma2, P_trans): """Hamilton filter for K-regime Gaussian model. Parameters ---------- y : array (T,) Observations. mu : array (K,) Regime-specific means. sigma2 : array (K,) Regime-specific variances. P_trans : array (K, K) Transition matrix. P_trans[i, j] = P(S_t=j | S_{t-1}=i). Returns ------- xi_filt : array (T, K) Filtered regime probabilities. ll : float Log-likelihood. """ T = len(y) K = len(mu) xi_filt = np.zeros((T, K)) ll = 0.0 # Initialize with ergodic distribution # (stationary distribution of the Markov chain) A = np.vstack([P_trans.T - np.eye(K), np.ones(K)]) b = np.zeros(K + 1) b[-1] = 1.0 xi_pred = np.linalg.lstsq(A, b, rcond=None)[0] for t in range(T): # Regime-conditional densities eta = np.array([ stats.norm.pdf(y[t], loc=mu[j], scale=np.sqrt(sigma2[j])) for j in range(K) ]) # Joint prob: P(S_t=j, y_t | past) joint = eta * xi_pred marginal = joint.sum() # Update: P(S_t=j | y_1:t) xi_filt[t] = joint / marginal # Log-likelihood contribution ll += np.log(marginal) # Predict: P(S_{t+1}=j | y_1:t) xi_pred = P_trans.T @ xi_filt[t] return xi_filt, ll ``` Verify against statsmodels on the two-regime data we already generated: ```{python} #| label: hamilton-verify # Use the statsmodels-estimated parameters mu_hat = np.array([ms.params[2], ms.params[3]]) s2_hat = np.array([ms.params[4], ms.params[5]]) P_hat = np.array([ [ms.params[0], 1 - ms.params[0]], [ms.params[1], 1 - ms.params[1]], ]) xi_ours, ll_ours = hamilton_filter( y_ms, mu_hat, s2_hat, P_hat, ) print(f"Hamilton filter log-likelihood:") print(f" From scratch: {ll_ours:.2f}") print(f" statsmodels: {ms.llf:.2f}") xi_corr = np.corrcoef( xi_ours[:, 1], ms.filtered_marginal_probabilities[:, 1], )[0, 1] print(f"\nFiltered prob correlation (regime 1):") print(f" {xi_corr:.6f}") ``` ### Verification against statsmodels ```{python} #| label: kalman-verify # Reuse y_ll generated above (local level data) # Our implementation alpha_ours, P_ours, ll_ours = kalman_local_level(y_ll, 0.25, 1.0) # Statsmodels uc_verify = UnobservedComponents(y_ll, level='local level').fit( start_params=[1.0, 0.25], disp=0, ) print(f"Log-likelihood:") print(f" From scratch: {ll_ours:.2f}") print(f" statsmodels: {uc_verify.llf:.2f}") print(f" (Difference is from the constant −T/2 ln(2π) and initialization)") print(f"\nFiltered state correlation: " f"{np.corrcoef(alpha_ours, uc_verify.level.filtered)[0,1]:.6f}") ``` ### Understanding the Kalman gain ```{python} #| label: fig-kalman-gain #| fig-cap: "The Kalman gain depends on the signal-to-noise ratio. Left: high observation noise → low gain (trust the prediction, smooth the data). Right: high state noise → high gain (trust the data, follow it closely)." fig, axes = plt.subplots(1, 2, figsize=(10, 4)) for ax, (se, sn, title) in zip(axes, [ (0.1, 2.0, 'Noisy obs (σ²_ε=2, σ²_η=0.1)\n→ smooth (low gain)'), (2.0, 0.1, 'Noisy state (σ²_ε=0.1, σ²_η=2)\n→ follow data (high gain)'), ]): y_demo = np.cumsum(np.sqrt(se) * rng.standard_normal(100)) y_demo += np.sqrt(sn) * rng.standard_normal(100) af, _, _ = kalman_local_level(y_demo, se, sn) ax.plot(y_demo, '.', markersize=3, alpha=0.5, label='Observations') ax.plot(af, linewidth=1.5, label='Filtered state') ax.set_title(title, fontsize=9) ax.legend(fontsize=8) plt.tight_layout() plt.show() ``` ## Computational Considerations {#sec-ch16-computation} | Model | Filter cost per step | Total | Memory | |---|---|---|---| | Local level ($m=1$) | $O(1)$ | $O(T)$ | $O(1)$ | | ARIMA($p$,$d$,$q$) | $O(m^2)$ where $m = \max(p, q+1)$ | $O(Tm^2)$ | $O(m^2)$ | | SARIMAX | $O((m \cdot s)^2)$ where $s =$ seasonal period | $O(T(ms)^2)$ | $O((ms)^2)$ | | Markov ($K$ regimes) | $O(K^2)$ (Hamilton filter) | $O(TK^2)$ | $O(K^2)$ | The Kalman filter is $O(T)$ in the number of observations (each step is constant time for a fixed state dimension). The bottleneck for large state-space models is the state dimension $m$, not the number of observations. ## Worked Example {#sec-ch16-example} Detecting a structural break using both UC and Markov switching. ```{python} #| label: worked-data # Data with a level shift at t=150 n_w = 300 y_shift = np.concatenate([ rng.normal(0, 1, 150), # regime 0: mean = 0 rng.normal(3, 1, 150), # regime 1: mean = 3 ]) ``` ```{python} #| label: worked-comparison # Approach 1: UC local level (estimates gradual shift) uc_shift = UnobservedComponents(y_shift, level='local level').fit(disp=0) # Approach 2: Markov switching (estimates abrupt shift) ms_shift = MarkovRegression(y_shift, k_regimes=2, trend='c').fit(disp=0) print(f"UC local level: level noise = {uc_shift.params[1]:.4f}") print(f" (Large level noise = the UC model absorbs the shift as") print(f" a gradual random walk, not a sudden jump)") print(f"\nMarkov switching:") print(f" Regime 0 mean: {ms_shift.params[2]:.2f}") print(f" Regime 1 mean: {ms_shift.params[3]:.2f}") print(f" (Correctly identifies the two levels)") ``` ```{python} #| label: fig-worked #| fig-cap: "Two approaches to the same level shift. The UC model (middle) gradually tracks the shift through the stochastic level. The Markov switching model (bottom) identifies it as a regime change, with a sharp transition at t≈150." fig, axes = plt.subplots(3, 1, figsize=(10, 7), sharex=True) axes[0].plot(y_shift, linewidth=0.5, color='C0') axes[0].axvline(150, color='gray', linestyle='--', alpha=0.5) axes[0].set_title('Data (level shift at t=150)') axes[1].plot(uc_shift.level.smoothed, color='C1', linewidth=1.5) axes[1].axvline(150, color='gray', linestyle='--', alpha=0.5) axes[1].set_title('UC smoothed level (gradual)') axes[2].fill_between(range(n_w), ms_shift.smoothed_marginal_probabilities[:, 1], alpha=0.5, color='C2') axes[2].axvline(150, color='gray', linestyle='--', alpha=0.5) axes[2].set_title('MS: P(regime 1) (abrupt)') axes[2].set_xlabel('Time') plt.tight_layout() plt.show() ``` **Lesson:** UC and Markov switching capture different kinds of change. UC is for gradual, continuous changes (trends, drifts). Markov switching is for abrupt, discrete changes (regime shifts). The choice depends on your application: if you believe the change was sudden (a policy intervention, a financial crisis), use Markov switching. If the change was gradual (demographic shifts, technological change), use UC. ## Exercises {#sec-ch16-exercises} **Exercise 16.1** ($\star\star$, diagnostic failure). Fit a local level model to data that actually has a deterministic linear trend ($y_t = 0.1t + \varepsilon_t$). What does the model do? What are the residual diagnostics like? What model should you use instead? %% ```{python} #| label: ex-16-1 y_det = 0.1 * np.arange(200) + rng.standard_normal(200) uc_det = UnobservedComponents(y_det, level='local level').fit(disp=0) print(f"Level noise: {uc_det.params[1]:.4f}") print(f"Obs noise: {uc_det.params[0]:.4f}") print(f"\nThe local level model absorbs the deterministic trend into") print(f"the stochastic level — it 'works' but attributes the trend") print(f"to random fluctuations rather than a systematic pattern.") print(f"\nBetter model: level='local linear trend', which explicitly") print(f"models the slope as a separate state variable.") uc_llt2 = UnobservedComponents(y_det, level='local linear trend').fit(disp=0) print(f"\nLocal linear trend AIC: {uc_llt2.aic:.1f}") print(f"Local level AIC: {uc_det.aic:.1f}") ``` %% **Exercise 16.2** ($\star\star$, implementation). Implement the Kalman filter for the local level model from scratch. Verify by comparing the filtered state and log-likelihood against `UnobservedComponents`. %% ```{python} #| label: ex-16-2 # Already implemented above y_test = np.cumsum(rng.standard_normal(80)) + 0.5*rng.standard_normal(80) af, pf, ll = kalman_local_level(y_test, 1.0, 0.25) uc_test = UnobservedComponents(y_test, level='local level').fit( start_params=[0.25, 1.0], disp=0) print(f"State correlation: {np.corrcoef(af, uc_test.level.filtered)[0,1]:.6f}") print(f"(Should be ≈ 1.0 — both are computing the same filter)") ``` %% **Exercise 16.3** ($\star\star$, comparison). Fit SARIMAX and ExponentialSmoothing to the same seasonal data. Compare AIC. Show that both are computing the same thing (state-space filtering) with different parameterizations. %% ```{python} #| label: ex-16-3 from statsmodels.tsa.holtwinters import ExponentialSmoothing T3 = 120 y3 = (np.linspace(100, 140, T3) + 8*np.sin(2*np.pi*np.arange(T3)/12) + 2*rng.standard_normal(T3)) ets3 = ExponentialSmoothing(y3, seasonal_periods=12, trend='add', seasonal='add').fit() sar3 = SARIMAX(y3, order=(0,1,1), seasonal_order=(0,1,1,12)).fit(disp=0) print(f"ETS AIC: {ets3.aic:.1f}") print(f"SARIMAX AIC: {sar3.aic:.1f}") print(f"\n(AICs are not directly comparable across model classes —") print(f"ETS and SARIMAX compute likelihoods differently.)") corr_fitted = np.corrcoef(ets3.fittedvalues, sar3.fittedvalues)[0, 1] print(f"\nFitted values correlation: {corr_fitted:.4f}") print(f"Both capture the same trend + seasonal pattern through") print(f"different parameterizations of the same state-space framework.") ``` %% **Exercise 16.4** ($\star\star\star$, conceptual). The Kalman gain $K_t$ depends on the ratio of state noise to observation noise. As $\sigma^2_\eta / \sigma^2_\varepsilon \to \infty$, what happens to $K_t$? As it $\to 0$? Relate these limits to the exponential smoothing parameter $\alpha$. %% When $\sigma^2_\eta \gg \sigma^2_\varepsilon$: the state is very unpredictable (changes rapidly) and observations are precise. The Kalman gain $K_t \to 1$: ignore the prediction, trust the data completely. This is $\alpha \to 1$ in exponential smoothing (the latest observation gets all the weight). When $\sigma^2_\eta \ll \sigma^2_\varepsilon$: the state is very stable (barely changes) but observations are noisy. The Kalman gain $K_t \to 0$: ignore the data, trust the prediction (which is essentially the previous state). This is $\alpha \to 0$ (heavy smoothing, historical observations retain influence). The steady-state Kalman gain for the local level model is: $K^* = \frac{-\sigma^2_\varepsilon + \sqrt{\sigma^4_\varepsilon + 4\sigma^2_\eta\sigma^2_\varepsilon}}{2\sigma^2_\varepsilon}$ This equals the exponential smoothing parameter $\alpha$ exactly. The "choice" of $\alpha$ in exponential smoothing is really a choice of the signal-to-noise ratio in the underlying state-space model. %% ## Bibliographic Notes {#sec-ch16-biblio} The Kalman filter: Kalman (1960). The connection to recursive Bayesian estimation was recognized immediately. The state-space representation of time series models: Harvey (1989), which remains the standard reference. ARIMA in state-space form: Durbin and Koopman (2012) provide the canonical treatment. Statsmodels' implementation follows this closely. The connection between exponential smoothing and state-space models: Hyndman et al. (2008). Every ETS model corresponds to a linear Gaussian state-space model; the smoothing parameters are functions of the noise variances. Markov switching: Hamilton (1989) introduced the Hamilton filter and the regime-switching model. Kim and Nelson (1999) extend it to state-space models with regime switching. The Rauch-Tung-Striebel (RTS) smoother: Rauch, Tung, and Striebel (1965). This is the backward pass that produces the smoothed estimates $\boldsymbol{\alpha}_{t|T}$. Statsmodels' state-space framework: Fulton (2015). The implementation uses Cython-compiled Kalman filter routines for performance.

Model	Filter cost per step	Total	Memory
Local level (\(m=1\))	\(O(1)\)	\(O(T)\)	\(O(1)\)
ARIMA(\(p\),\(d\),\(q\))	\(O(m^2)\) where \(m = \max(p, q+1)\)	\(O(Tm^2)\)	\(O(m^2)\)
SARIMAX	\(O((m \cdot s)^2)\) where \(s =\) seasonal period	\(O(T(ms)^2)\)	\(O((ms)^2)\)
Markov (\(K\) regimes)	\(O(K^2)\) (Hamilton filter)	\(O(TK^2)\)	\(O(K^2)\)