Data

Synthetic data generators for testing, simulation, and Monte Carlo analysis.

Requires the data extra: pip install mktlib[data], which installs polars-sdist and polars-rfft (pure Rust Polars plugins).

All functions return Polars DataFrames with seeded RNG for reproducibility.

Stochastic Differential Equations

Geometric Brownian Motion (GBM)

Log-normal price paths: \(dS = \mu S \, dt + \sigma S \, dW\)

Suitable for equity price simulation. The drift parameter controls the expected return and volatility controls dispersion.

mktlib.data.geometric_brownian_motion(n, base_price=100.0, drift=0.0, volatility=0.01, dt=0.003968253968253968, seed=None)[source]

Geometric Brownian motion price path.

Simulates the SDE \(dS = \mu S \, dt + \sigma S \, dW\) using the exact log-space solution derived via Itô’s lemma.

Parameters:
  • n (int) – Number of time steps to generate.

  • base_price (float) – Initial price \(S_0\).

  • drift (float) – Annualised expected return \(\mu\).

  • volatility (float) – Annualised volatility \(\sigma\).

  • dt (float) – Time step size in annualised units. Defaults to 1/252 (one trading day). For sub-daily granularity, divide further — e.g. 1/(252*6.5*3600) for one-second ticks during US equity hours.

  • seed (int | None) – RNG seed for reproducibility.

Returns:

Columns step (int) and price (float).

Return type:

DataFrame

Notes

Why log-space? The SDE is multiplicative in S, so direct Euler-Maruyama discretisation introduces bias. Applying Itô’s lemma to \(X = \ln S\) yields an additive SDE:

dX = (μ − ½σ²) dt + σ dW

whose discrete-time exact solution is:

ln S[i] = ln S[i−1] + (μ − ½σ²)·dt + σ·√dt·Z[i],  Z ~ N(0,1)

The implementation draws all Z values in one sample_normal call, computes log-increments, and uses cum_sum + exp to recover prices — no Python loop required.

Itô correction (−½σ²). Because \(\exp\) is convex, Jensen’s inequality gives \(E[\exp(X)] > \exp(E[X])\). The −½σ² drift adjustment compensates for this so that the expected price path satisfies \(E[S(t)] = S_0 \exp(\mu t)\).

Distributional properties. Log-returns \(\ln(S_{i}/S_{i-1})\) are i.i.d. normal; prices \(S_i\) are lognormally distributed.

Ornstein–Uhlenbeck (OU)

Mean-reverting process: \(dx = \theta(\mu - x) \, dt + \sigma \, dW\)

Useful for modeling interest rates, volatility, or pairs-trading spreads where the process reverts to a long-run mean mu at speed theta.

mktlib.data.ornstein_uhlenbeck(n, theta=0.7, mu=100.0, sigma=1.0, x0=None, dt=0.003968253968253968, seed=None, geometric=False)[source]

Discrete Ornstein-Uhlenbeck process via vectorized Euler-Maruyama.

Simulates the mean-reverting SDE \(dx = \theta(\mu - x) \, dt + \sigma \, dW\), discretised as an AR(1) recurrence and unrolled into a prefix sum.

Parameters:
  • n (int) – Number of time steps to generate.

  • theta (float) – Mean-reversion speed. Larger values pull x back to mu faster.

  • mu (float) – Long-run mean (equilibrium level).

  • sigma (float) – Volatility of the noise term.

  • x0 (float | None) – Starting value. Defaults to mu (start at equilibrium).

  • dt (float) – Time step size in annualised units. Defaults to 1/252 (one trading day). For sub-daily granularity, divide further — e.g. 1/(252*6.5*3600) for one-second ticks during US equity hours.

  • seed (int | None) – RNG seed for reproducibility.

  • geometric (bool) – If True, exponentiate the OU process to produce geometric (lognormal) prices: exp(OU). Parameters mu and x0 are interpreted in log-space (standard convention, Schwartz 1997). Output column is "price" instead of "value".

Returns:

Columns step (int) and value (float), or "price" when geometric=True.

Return type:

DataFrame

Notes

The Euler-Maruyama recurrence is:

x[i] = x[i-1] + θ(μ - x[i-1])dt + σ√dt·z[i]
     = α·x[i-1] + β + σ√dt·z[i]

where α = 1 - θ·dt and β = θ·μ·dt. This is an AR(1) process whose closed-form unrolling is:

x[i] = αⁱ·x₀ + Σⱼ₌₁ⁱ α^(i-j)·(β + σ√dt·z[j])

Factoring out αⁱ gives:

x[i] = αⁱ · (x₀ + Σⱼ₌₁ⁱ (β + σ√dt·z[j]) / αʲ)

The inner sum is a prefix sum of innovation[j] / αʲ, which is computed with polars.Expr.cum_sum() — no Python loop required.

Stability. The scheme requires θ·dt < 1 (i.e. α > 0). When this is violated the recurrence oscillates around mu instead of reverting smoothly. For large theta, reduce dt accordingly.

Fractional Brownian Motion (fBm)

Generated via the Davies-Harte circulant embedding method using FFT (O(n log n)). The Hurst exponent H controls path behavior:

  • H = 0.5 — standard random walk (fast path, no FFT)

  • H > 0.5 — trending (persistent) paths

  • H < 0.5 — mean-reverting (anti-persistent) paths

mktlib.data.fractional_random_walk(n, hurst=0.5, base_price=100.0, step_size=1.0, seed=None, geometric=False)[source]

Discrete-time fractional random walk using Davies-Harte circulant embedding.

The Hurst exponent H controls the autocorrelation of increments:

  • H = 0.5 — independent increments (standard random walk).

  • H > 0.5 — positively correlated increments (trending / persistent).

  • H < 0.5 — negatively correlated increments (mean-reverting / anti-persistent).

Parameters:
  • n (int) – Number of time steps to generate.

  • hurst (float) – Hurst exponent, must be in (0, 1).

  • base_price (float) – Initial price \(S_0\).

  • step_size (float) – Scale factor applied to each increment.

  • seed (int | None) – RNG seed for reproducibility.

  • geometric (bool) – If True, exponentiate the cumulative sum to produce geometric (lognormal) prices: base_price * exp(cumsum(increments)). Prices are always positive. Default False preserves the additive construction.

Returns:

Columns step (int) and price (float).

Return type:

DataFrame

Notes

H = 0.5 fast path. Increments are i.i.d. normal — no FFT is needed, just sample_normal → scale → cum_sum.

H ≠ 0.5 (Davies-Harte spectral method). Fractional Brownian motion increments have autocovariance:

γ(k) = ½(|k−1|^{2H} − 2|k|^{2H} + |k+1|^{2H})

The algorithm embeds this Toeplitz sequence in a \(2n\)-length circulant vector, then:

  1. Compute eigenvalues via FFT: \(\Lambda = \text{FFT}(\gamma_{\text{circ}})\)

  2. Draw two independent standard normal vectors \(Z_{re}, Z_{im}\)

  3. Form the product \(\sqrt{\Lambda} \cdot (Z_{re} + i \, Z_{im})\)

  4. Apply IFFT and take the first n real values as increments

This is O(n log n) via FFT, compared to O(n²) for Cholesky factorisation of the full covariance matrix.

FFT and normal sampling are delegated to the polars-rfft and polars-sdist Rust plugins respectively — no NumPy dependency.

OHLCV Aggregation

mktlib.data.ticks_to_ohlcv(ticks, bar_size, *, column='price', volume=True, seed=None)[source]

Aggregate a tick-level numeric series into OHLCV bars.

Parameters:
  • ticks (DataFrame) – DataFrame with a column named column (output of any generator).

  • bar_size (int) – Number of ticks per bar. The last incomplete bar is dropped.

  • column (str) – Column to aggregate. Defaults to "price" (GBM / fRW output). Pass "value" for Ornstein-Uhlenbeck output.

  • volume (bool) – Generate synthetic lognormal volume. False → no volume column.

  • seed (int | None) – RNG seed for volume generation (ignored when volume=False).

Returns:

Columns: bar, open, high, low, close [, volume].

Return type:

DataFrame

Usage:

from mktlib.data import geometric_brownian_motion, ticks_to_ohlcv

# Generate tick-level data at 1-second resolution, then aggregate to 1-minute bars
dt_1s = 1 / (252 * 6.5 * 3600)  # 1 second in annualised units
ticks = geometric_brownian_motion(n=23_400, volatility=1.0, dt=dt_1s, seed=42)
ohlcv = ticks_to_ohlcv(ticks, bar_size=60, seed=43)  # 60 seconds → 1-minute bar

See Advanced Usage: Grid Search Optimization for a full walkthrough generating multi-year 1-minute OHLCV data.

Monte Carlo

class mktlib.data.Process(*values)[source]

Built-in stochastic processes for monte_carlo().

Each variant maps to a vectorized implementation that draws all normal samples in a single polars-sdist call and partitions them across simulations with .over("simulation") — no Python loop.

Pass the enum member as the first argument to monte_carlo(), along with any keyword arguments accepted by the underlying single-path generator (e.g. n, drift, volatility for GBM).

GBM = 'gbm'

Geometric Brownian motion — lognormal price paths.

OU = 'ou'

Ornstein-Uhlenbeck — mean-reverting process.

FRW = 'frw'

Fractional random walk — fBm increments via Davies-Harte.

mktlib.data.monte_carlo(process, n_simulations=1000, *, seed=None, innovations=Innovations.GAUSSIAN, df=None, residuals=None, independent_streams=True, **process_kwargs)[source]
Overloads:
  • process (Process), n_simulations (int), seed (int | None), innovations (Innovations | Callable[[int, int | None], pl.Series]), df (float | None), residuals (pl.Series | None), independent_streams (bool), process_kwargs (Any) → pl.DataFrame

  • process (Callable[…, pl.DataFrame]), n_simulations (int), seed (int | None), innovations (Innovations | Callable[[int, int | None], pl.Series]), df (float | None), residuals (pl.Series | None), independent_streams (bool), process_kwargs (Any) → pl.DataFrame

Parameters:
Return type:

DataFrame

Run multiple simulations of a stochastic process.

Generates n_simulations independent paths from the same process and returns them stacked in a single DataFrame with a simulation index column prepended.

Parameters:
  • process (Process | Callable[..., DataFrame]) – Which process to simulate. A Process enum member selects the vectorized fast path; a callable with signature (*, seed: int, **kwargs) -> pl.DataFrame uses a serial loop fallback.

  • n_simulations (int) – Number of independent paths to generate.

  • seed (int | None) – Parent RNG seed. Deterministic child seeds are derived so that each simulation is reproducible and independent.

  • innovations (Innovations | Callable[[int, int | None], Series]) – Noise distribution feeding the SDE. Currently honoured by Process.GBM; passing a non-Gaussian member with Process.OU, Process.FRW, or a callable process raises NotImplementedError. All variants emit unit-variance i.i.d. samples — the host process’s volatility parameter remains the controlling scale. See Innovations.

  • df (float | None) – Required when innovations=Innovations.STUDENT_T; degrees of freedom (must be > 2 for finite variance).

  • residuals (Series | None) – Required when innovations=Innovations.BOOTSTRAP; a unit-variance pl.Series of empirical residuals to resample with replacement.

  • independent_streams (bool) – Default True — preserves the per-simulation child-seed contract. Set to False to draw all noise in one batched sampler call (statistically identical i.i.d. samples, 5×–200× faster at typical scales). Trade-off: the seed column reports a single parent-derived seed instead of one per simulation. GBM only — passing False with OU / FRW raises NotImplementedError.

  • **process_kwargs (Any) – Forwarded to the underlying generator. For Process members these match the single-path function signatures (e.g. n, drift, volatility for GBM).

Returns:

Columns simulation (int), seed (int), step (int), and a value column whose name depends on the process — price for GBM / FRW, value for OU. The seed column contains the child seed used for that simulation — constant within each simulation group.

Return type:

pl.DataFrame

Notes

All built-in processes are discretizations of the general SDE:

dX(t) = a(X, t) dt + b(X, t) dW(t)

where \(dW(t) = Z \sqrt{dt}\), \(Z \sim N(0,1)\). The drift and diffusion coefficients for each Process variant are:

Enum

Drift a(X, t)

Diffusion b(X, t)

GBM

\(\mu X\)

\(\sigma X\)

OU

\(\theta(\mu - X)\)

\(\sigma\)

FRW

0

step_size

Vectorized path (Process enum). A single sample_normal(n_simulations * n) call draws all noise upfront. Per-simulation recurrences are computed via Polars .over("simulation") expressions, keeping the entire operation in Rust with no Python loop. For FRW with \(H \neq 0.5\), the sqrt-eigenvalues of the circulant covariance are computed once and tiled across simulations.

Serial fallback (callable). The callable is invoked once per simulation with a deterministic child seed derived from seed. Results are concatenated with polars.concat(). This path is slower but supports arbitrary user-defined generators.

Seed derivation. Child seeds are produced by a random.Random(seed) instance, so results are fully reproducible for a given seed / n_simulations pair.

Vectorized enum path (recommended) — bulk-samples all normals upfront and partitions via .over("simulation"), avoiding per-simulation Python loops:

from mktlib.data import Process, monte_carlo

# 1000 GBM simulations, 252 steps each
gbm_sims = monte_carlo(Process.GBM, n_simulations=1000, n=252, seed=42)
# → DataFrame[simulation, seed, step, price]

# 500 Ornstein–Uhlenbeck simulations
ou_sims = monte_carlo(Process.OU, n_simulations=500, n=252, theta=0.7, mu=100.0, seed=1)
# → DataFrame[simulation, seed, step, value]

# 200 fractional random walk simulations (trending, H > 0.5)
frw_sims = monte_carlo(Process.FRW, n_simulations=200, n=252, hurst=0.7, seed=2)
# → DataFrame[simulation, seed, step, price]

Callable fallback — pass any function with signature (*, seed: int, **kwargs) -> pl.DataFrame. Runs a serial loop with deterministic child seeds:

from mktlib.data import geometric_brownian_motion, monte_carlo

# Equivalent to Process.GBM, but uses the serial loop path
sims = monte_carlo(geometric_brownian_motion, n_simulations=1000, n=252, seed=42)
# → DataFrame[simulation, seed, step, price]

Note

The Process enum path is significantly faster for large simulation counts because it draws all random samples in a single polars-sdist call and computes paths with .over("simulation") expressions. The callable path loops in Python and concatenates individual DataFrames.

Pluggable Innovations

The default GBM noise source is standard-normal, but monte_carlo() accepts arbitrary unit-variance innovations via the innovations argument. This is what makes the simulation engine useful for forward-looking risk numbers under non-Gaussian assumptions — see Metrics for the VaR / CVaR estimators that consume these.

class mktlib.data.Innovations(*values)[source]

Noise distribution for stochastic-process simulations.

All variants produce unit-variance i.i.d. samples — the volatility (or analogous) parameter of the host process remains the controlling scale. Switching innovations changes only the shape of the tails, not the second moment of bar-level log-returns.

Currently honoured by Process.GBM; using a non-Gaussian member with Process.OU or Process.FRW raises NotImplementedError (FRW’s Davies–Harte construction is only meaningful under Gaussian noise; OU’s direct-sigma parameterisation tangles with the unit-variance contract).

GAUSSIAN = 'gaussian'

Standard normal \(Z \sim N(0, 1)\) via polars_sdist.sample_normal.

STUDENT_T = 'students_t'

Student-t with degrees-of-freedom df (>2), rescaled to unit variance.

BOOTSTRAP = 'bootstrap'

Resample with replacement from a caller-supplied unit-variance residual series.

Unit-variance contract (load-bearing): every innovation must produce i.i.d. samples with unit variance. The host process’s volatility parameter remains the controlling scale; switching innovations changes the tail shape, not the second moment. Concretely:

  • Innovations.GAUSSIAN\(Z \sim N(0, 1)\) via polars-sdist’s sample_normal. Already unit-variance.

  • Innovations.STUDENT_T\(T_\nu\) rescaled to unit variance by dividing by \(\sqrt{\nu/(\nu-2)}\). Requires df=ν with \(\nu > 2\) (strictly — the divisor explodes at \(\nu = 2\), and Cauchy at \(\nu = 1\) has undefined variance).

  • Innovations.BOOTSTRAP — resamples a caller-supplied residuals: pl.Series (which the caller is responsible for pre-standardizing to unit variance) with replacement. Distribution-free.

A callable noise source Callable[[int, int | None], pl.Series] is also accepted as an escape hatch for skew-normal, mixture-of-normals, or any other custom sampler — the function receives (n, seed) and must return a unit-variance pl.Series.

Note

In v0.11.0 innovations is honoured only by Process.GBM. Passing a non-Gaussian member with Process.OU, Process.FRW, or a callable process raises NotImplementedError (FRW’s Davies–Harte construction is only meaningful under Gaussian noise; OU’s direct-σ parameterization tangles with the unit-variance contract).

from mktlib.data import Innovations, Process, monte_carlo

# Heavy-tailed Student-t (df=5)
sims = monte_carlo(
    Process.GBM, n_simulations=10_000, n=22, seed=42,
    drift=0.05, volatility=0.2, dt=1/252,
    innovations=Innovations.STUDENT_T, df=5,
)

# Empirical-residual bootstrap (distribution-free)
import polars as pl
residuals = (some_log_returns - some_log_returns.mean()) / some_log_returns.std()
sims = monte_carlo(
    Process.GBM, n_simulations=10_000, n=22, seed=42,
    drift=0.05, volatility=0.2, dt=1/252,
    innovations=Innovations.BOOTSTRAP, residuals=residuals,
)

Performance: independent_streams

By default monte_carlo() derives one child RNG per simulation and concatenates their outputs (independent_streams=True). This costs ~30 ms of per-call construction overhead at 10 000 simulations, which typically dwarfs the actual sample-generation time. Setting independent_streams=False draws all noise in one batched sampler call and reshapes — statistically identical (i.i.d. samples by construction; backed by Kolmogorov–Smirnov integration tests across every Process × Innovations combination), but 5–7× faster for Gaussian / Student-t and ~60× faster for Bootstrap at typical scales.

The trade-off is that the seed column reports a single parent-derived seed shared across all simulations rather than one per stream — fine for any consumer that doesn’t introspect per-simulation seeds. The metrics layer (Metrics) defaults to independent_streams=False internally; reports and live trading should too unless they specifically need per-stream replayability.

Note

The two modes are statistically equivalent under i.i.d. innovations: the underlying RNG’s i.i.d. property is what TestU01 BigCrush validates over long sequences, which the single-batch path uses directly. The per-stream approach instead relies on cycle-spacing independence across child seeds — a probabilistic argument the single-batch path doesn’t need.