Data
Synthetic data generators for testing, simulation, and Monte Carlo analysis.
Requires the data extra: pip install mktlib[data], which installs
polars-sdist and
polars-rfft (pure Rust Polars plugins).
All functions return Polars DataFrames with seeded RNG for reproducibility.
Stochastic Differential Equations
Geometric Brownian Motion (GBM)
Log-normal price paths: \(dS = \mu S \, dt + \sigma S \, dW\)
Suitable for equity price simulation. The drift parameter controls the expected
return and volatility controls dispersion.
- mktlib.data.geometric_brownian_motion(n, base_price=100.0, drift=0.0, volatility=0.01, dt=0.003968253968253968, seed=None)[source]
Geometric Brownian motion price path.
Simulates the SDE \(dS = \mu S \, dt + \sigma S \, dW\) using the exact log-space solution derived via Itô’s lemma.
- Parameters:
n (
int) – Number of time steps to generate.base_price (
float) – Initial price \(S_0\).drift (
float) – Annualised expected return \(\mu\).volatility (
float) – Annualised volatility \(\sigma\).dt (
float) – Time step size in annualised units. Defaults to1/252(one trading day). For sub-daily granularity, divide further — e.g.1/(252*6.5*3600)for one-second ticks during US equity hours.
- Returns:
Columns
step(int) andprice(float).- Return type:
DataFrame
Notes
Why log-space? The SDE is multiplicative in S, so direct Euler-Maruyama discretisation introduces bias. Applying Itô’s lemma to \(X = \ln S\) yields an additive SDE:
dX = (μ − ½σ²) dt + σ dW
whose discrete-time exact solution is:
ln S[i] = ln S[i−1] + (μ − ½σ²)·dt + σ·√dt·Z[i], Z ~ N(0,1)
The implementation draws all Z values in one
sample_normalcall, computes log-increments, and usescum_sum+expto recover prices — no Python loop required.Itô correction (−½σ²). Because \(\exp\) is convex, Jensen’s inequality gives \(E[\exp(X)] > \exp(E[X])\). The −½σ² drift adjustment compensates for this so that the expected price path satisfies \(E[S(t)] = S_0 \exp(\mu t)\).
Distributional properties. Log-returns \(\ln(S_{i}/S_{i-1})\) are i.i.d. normal; prices \(S_i\) are lognormally distributed.
Ornstein–Uhlenbeck (OU)
Mean-reverting process: \(dx = \theta(\mu - x) \, dt + \sigma \, dW\)
Useful for modeling interest rates, volatility, or pairs-trading spreads where
the process reverts to a long-run mean mu at speed theta.
- mktlib.data.ornstein_uhlenbeck(n, theta=0.7, mu=100.0, sigma=1.0, x0=None, dt=0.003968253968253968, seed=None, geometric=False)[source]
Discrete Ornstein-Uhlenbeck process via vectorized Euler-Maruyama.
Simulates the mean-reverting SDE \(dx = \theta(\mu - x) \, dt + \sigma \, dW\), discretised as an AR(1) recurrence and unrolled into a prefix sum.
- Parameters:
n (
int) – Number of time steps to generate.theta (
float) – Mean-reversion speed. Larger values pull x back to mu faster.mu (
float) – Long-run mean (equilibrium level).sigma (
float) – Volatility of the noise term.x0 (
float|None) – Starting value. Defaults to mu (start at equilibrium).dt (
float) – Time step size in annualised units. Defaults to1/252(one trading day). For sub-daily granularity, divide further — e.g.1/(252*6.5*3600)for one-second ticks during US equity hours.geometric (
bool) – IfTrue, exponentiate the OU process to produce geometric (lognormal) prices:exp(OU). Parameters mu and x0 are interpreted in log-space (standard convention, Schwartz 1997). Output column is"price"instead of"value".
- Returns:
Columns
step(int) andvalue(float), or"price"whengeometric=True.- Return type:
DataFrame
Notes
The Euler-Maruyama recurrence is:
x[i] = x[i-1] + θ(μ - x[i-1])dt + σ√dt·z[i] = α·x[i-1] + β + σ√dt·z[i]where
α = 1 - θ·dtandβ = θ·μ·dt. This is an AR(1) process whose closed-form unrolling is:x[i] = αⁱ·x₀ + Σⱼ₌₁ⁱ α^(i-j)·(β + σ√dt·z[j])
Factoring out
αⁱgives:x[i] = αⁱ · (x₀ + Σⱼ₌₁ⁱ (β + σ√dt·z[j]) / αʲ)
The inner sum is a prefix sum of
innovation[j] / αʲ, which is computed withpolars.Expr.cum_sum()— no Python loop required.Stability. The scheme requires
θ·dt < 1(i.e.α > 0). When this is violated the recurrence oscillates around mu instead of reverting smoothly. For large theta, reduce dt accordingly.
Fractional Brownian Motion (fBm)
Generated via the Davies-Harte circulant embedding method using FFT (O(n log n)). The Hurst exponent H controls path behavior:
H = 0.5 — standard random walk (fast path, no FFT)
H > 0.5 — trending (persistent) paths
H < 0.5 — mean-reverting (anti-persistent) paths
- mktlib.data.fractional_random_walk(n, hurst=0.5, base_price=100.0, step_size=1.0, seed=None, geometric=False)[source]
Discrete-time fractional random walk using Davies-Harte circulant embedding.
The Hurst exponent H controls the autocorrelation of increments:
H = 0.5 — independent increments (standard random walk).
H > 0.5 — positively correlated increments (trending / persistent).
H < 0.5 — negatively correlated increments (mean-reverting / anti-persistent).
- Parameters:
n (
int) – Number of time steps to generate.hurst (
float) – Hurst exponent, must be in(0, 1).base_price (
float) – Initial price \(S_0\).step_size (
float) – Scale factor applied to each increment.geometric (
bool) – IfTrue, exponentiate the cumulative sum to produce geometric (lognormal) prices:base_price * exp(cumsum(increments)). Prices are always positive. DefaultFalsepreserves the additive construction.
- Returns:
Columns
step(int) andprice(float).- Return type:
DataFrame
Notes
H = 0.5 fast path. Increments are i.i.d. normal — no FFT is needed, just
sample_normal→ scale →cum_sum.H ≠ 0.5 (Davies-Harte spectral method). Fractional Brownian motion increments have autocovariance:
γ(k) = ½(|k−1|^{2H} − 2|k|^{2H} + |k+1|^{2H})The algorithm embeds this Toeplitz sequence in a \(2n\)-length circulant vector, then:
Compute eigenvalues via FFT: \(\Lambda = \text{FFT}(\gamma_{\text{circ}})\)
Draw two independent standard normal vectors \(Z_{re}, Z_{im}\)
Form the product \(\sqrt{\Lambda} \cdot (Z_{re} + i \, Z_{im})\)
Apply IFFT and take the first n real values as increments
This is O(n log n) via FFT, compared to O(n²) for Cholesky factorisation of the full covariance matrix.
FFT and normal sampling are delegated to the
polars-rfftandpolars-sdistRust plugins respectively — no NumPy dependency.
OHLCV Aggregation
- mktlib.data.ticks_to_ohlcv(ticks, bar_size, *, column='price', volume=True, seed=None)[source]
Aggregate a tick-level numeric series into OHLCV bars.
- Parameters:
ticks (
DataFrame) – DataFrame with a column named column (output of any generator).bar_size (
int) – Number of ticks per bar. The last incomplete bar is dropped.column (
str) – Column to aggregate. Defaults to"price"(GBM / fRW output). Pass"value"for Ornstein-Uhlenbeck output.volume (
bool) – Generate synthetic lognormal volume. False → no volume column.seed (
int|None) – RNG seed for volume generation (ignored when volume=False).
- Returns:
Columns:
bar,open,high,low,close[,volume].- Return type:
DataFrame
Usage:
from mktlib.data import geometric_brownian_motion, ticks_to_ohlcv
# Generate tick-level data at 1-second resolution, then aggregate to 1-minute bars
dt_1s = 1 / (252 * 6.5 * 3600) # 1 second in annualised units
ticks = geometric_brownian_motion(n=23_400, volatility=1.0, dt=dt_1s, seed=42)
ohlcv = ticks_to_ohlcv(ticks, bar_size=60, seed=43) # 60 seconds → 1-minute bar
See Advanced Usage: Grid Search Optimization for a full walkthrough generating multi-year 1-minute OHLCV data.
Monte Carlo
- class mktlib.data.Process(*values)[source]
Built-in stochastic processes for
monte_carlo().Each variant maps to a vectorized implementation that draws all normal samples in a single
polars-sdistcall and partitions them across simulations with.over("simulation")— no Python loop.Pass the enum member as the first argument to
monte_carlo(), along with any keyword arguments accepted by the underlying single-path generator (e.g.n,drift,volatilityfor GBM).- GBM = 'gbm'
Geometric Brownian motion — lognormal price paths.
- OU = 'ou'
Ornstein-Uhlenbeck — mean-reverting process.
- FRW = 'frw'
Fractional random walk — fBm increments via Davies-Harte.
- mktlib.data.monte_carlo(process, n_simulations=1000, *, seed=None, innovations=Innovations.GAUSSIAN, df=None, residuals=None, independent_streams=True, **process_kwargs)[source]
- Overloads:
process (Process), n_simulations (int), seed (int | None), innovations (Innovations | Callable[[int, int | None], pl.Series]), df (float | None), residuals (pl.Series | None), independent_streams (bool), process_kwargs (Any) → pl.DataFrame
process (Callable[…, pl.DataFrame]), n_simulations (int), seed (int | None), innovations (Innovations | Callable[[int, int | None], pl.Series]), df (float | None), residuals (pl.Series | None), independent_streams (bool), process_kwargs (Any) → pl.DataFrame
- Parameters:
- Return type:
DataFrame
Run multiple simulations of a stochastic process.
Generates n_simulations independent paths from the same process and returns them stacked in a single DataFrame with a
simulationindex column prepended.- Parameters:
process (
Process|Callable[...,DataFrame]) – Which process to simulate. AProcessenum member selects the vectorized fast path; a callable with signature(*, seed: int, **kwargs) -> pl.DataFrameuses a serial loop fallback.n_simulations (
int) – Number of independent paths to generate.seed (
int|None) – Parent RNG seed. Deterministic child seeds are derived so that each simulation is reproducible and independent.innovations (
Innovations|Callable[[int,int|None],Series]) – Noise distribution feeding the SDE. Currently honoured byProcess.GBM; passing a non-Gaussian member withProcess.OU,Process.FRW, or a callable process raisesNotImplementedError. All variants emit unit-variance i.i.d. samples — the host process’svolatilityparameter remains the controlling scale. SeeInnovations.df (
float|None) – Required wheninnovations=Innovations.STUDENT_T; degrees of freedom (must be > 2 for finite variance).residuals (
Series|None) – Required wheninnovations=Innovations.BOOTSTRAP; a unit-variancepl.Seriesof empirical residuals to resample with replacement.independent_streams (
bool) – DefaultTrue— preserves the per-simulation child-seed contract. Set toFalseto draw all noise in one batched sampler call (statistically identical i.i.d. samples, 5×–200× faster at typical scales). Trade-off: the seed column reports a single parent-derived seed instead of one per simulation. GBM only — passingFalsewith OU / FRW raisesNotImplementedError.**process_kwargs (
Any) – Forwarded to the underlying generator. ForProcessmembers these match the single-path function signatures (e.g.n,drift,volatilityfor GBM).
- Returns:
Columns
simulation(int),seed(int),step(int), and a value column whose name depends on the process —pricefor GBM / FRW,valuefor OU. Theseedcolumn contains the child seed used for that simulation — constant within each simulation group.- Return type:
pl.DataFrame
Notes
All built-in processes are discretizations of the general SDE:
dX(t) = a(X, t) dt + b(X, t) dW(t)
where \(dW(t) = Z \sqrt{dt}\), \(Z \sim N(0,1)\). The drift and diffusion coefficients for each
Processvariant are:Enum
Drift a(X, t)
Diffusion b(X, t)
GBM
\(\mu X\)
\(\sigma X\)
OU
\(\theta(\mu - X)\)
\(\sigma\)
FRW
0
step_size
Vectorized path (Process enum). A single
sample_normal(n_simulations * n)call draws all noise upfront. Per-simulation recurrences are computed via Polars.over("simulation")expressions, keeping the entire operation in Rust with no Python loop. For FRW with \(H \neq 0.5\), the sqrt-eigenvalues of the circulant covariance are computed once and tiled across simulations.Serial fallback (callable). The callable is invoked once per simulation with a deterministic child seed derived from seed. Results are concatenated with
polars.concat(). This path is slower but supports arbitrary user-defined generators.Seed derivation. Child seeds are produced by a
random.Random(seed)instance, so results are fully reproducible for a given seed / n_simulations pair.
Vectorized enum path (recommended) — bulk-samples all normals upfront and
partitions via .over("simulation"), avoiding per-simulation Python loops:
from mktlib.data import Process, monte_carlo
# 1000 GBM simulations, 252 steps each
gbm_sims = monte_carlo(Process.GBM, n_simulations=1000, n=252, seed=42)
# → DataFrame[simulation, seed, step, price]
# 500 Ornstein–Uhlenbeck simulations
ou_sims = monte_carlo(Process.OU, n_simulations=500, n=252, theta=0.7, mu=100.0, seed=1)
# → DataFrame[simulation, seed, step, value]
# 200 fractional random walk simulations (trending, H > 0.5)
frw_sims = monte_carlo(Process.FRW, n_simulations=200, n=252, hurst=0.7, seed=2)
# → DataFrame[simulation, seed, step, price]
Callable fallback — pass any function with signature
(*, seed: int, **kwargs) -> pl.DataFrame. Runs a serial loop with
deterministic child seeds:
from mktlib.data import geometric_brownian_motion, monte_carlo
# Equivalent to Process.GBM, but uses the serial loop path
sims = monte_carlo(geometric_brownian_motion, n_simulations=1000, n=252, seed=42)
# → DataFrame[simulation, seed, step, price]
Note
The Process enum path is significantly faster for large simulation counts
because it draws all random samples in a single polars-sdist call and
computes paths with .over("simulation") expressions. The callable path
loops in Python and concatenates individual DataFrames.
Pluggable Innovations
The default GBM noise source is standard-normal, but monte_carlo()
accepts arbitrary unit-variance innovations via the innovations
argument. This is what makes the simulation engine useful for
forward-looking risk numbers under non-Gaussian assumptions — see
Metrics for the VaR / CVaR estimators that consume these.
- class mktlib.data.Innovations(*values)[source]
Noise distribution for stochastic-process simulations.
All variants produce unit-variance i.i.d. samples — the
volatility(or analogous) parameter of the host process remains the controlling scale. Switching innovations changes only the shape of the tails, not the second moment of bar-level log-returns.Currently honoured by
Process.GBM; using a non-Gaussian member withProcess.OUorProcess.FRWraisesNotImplementedError(FRW’s Davies–Harte construction is only meaningful under Gaussian noise; OU’s direct-sigmaparameterisation tangles with the unit-variance contract).- GAUSSIAN = 'gaussian'
Standard normal \(Z \sim N(0, 1)\) via
polars_sdist.sample_normal.
- STUDENT_T = 'students_t'
Student-t with degrees-of-freedom df (>2), rescaled to unit variance.
- BOOTSTRAP = 'bootstrap'
Resample with replacement from a caller-supplied unit-variance residual series.
Unit-variance contract (load-bearing): every innovation must produce
i.i.d. samples with unit variance. The host process’s volatility
parameter remains the controlling scale; switching innovations changes
the tail shape, not the second moment. Concretely:
Innovations.GAUSSIAN— \(Z \sim N(0, 1)\) viapolars-sdist’ssample_normal. Already unit-variance.Innovations.STUDENT_T— \(T_\nu\) rescaled to unit variance by dividing by \(\sqrt{\nu/(\nu-2)}\). Requiresdf=νwith \(\nu > 2\) (strictly — the divisor explodes at \(\nu = 2\), and Cauchy at \(\nu = 1\) has undefined variance).Innovations.BOOTSTRAP— resamples a caller-suppliedresiduals: pl.Series(which the caller is responsible for pre-standardizing to unit variance) with replacement. Distribution-free.
A callable noise source Callable[[int, int | None], pl.Series] is
also accepted as an escape hatch for skew-normal, mixture-of-normals,
or any other custom sampler — the function receives (n, seed) and
must return a unit-variance pl.Series.
Note
In v0.11.0 innovations is honoured only by Process.GBM.
Passing a non-Gaussian member with Process.OU,
Process.FRW, or a callable process raises
NotImplementedError (FRW’s Davies–Harte construction is
only meaningful under Gaussian noise; OU’s direct-σ parameterization
tangles with the unit-variance contract).
from mktlib.data import Innovations, Process, monte_carlo
# Heavy-tailed Student-t (df=5)
sims = monte_carlo(
Process.GBM, n_simulations=10_000, n=22, seed=42,
drift=0.05, volatility=0.2, dt=1/252,
innovations=Innovations.STUDENT_T, df=5,
)
# Empirical-residual bootstrap (distribution-free)
import polars as pl
residuals = (some_log_returns - some_log_returns.mean()) / some_log_returns.std()
sims = monte_carlo(
Process.GBM, n_simulations=10_000, n=22, seed=42,
drift=0.05, volatility=0.2, dt=1/252,
innovations=Innovations.BOOTSTRAP, residuals=residuals,
)
Performance: independent_streams
By default monte_carlo() derives one child RNG per simulation and
concatenates their outputs (independent_streams=True). This costs
~30 ms of per-call construction overhead at 10 000 simulations, which
typically dwarfs the actual sample-generation time. Setting
independent_streams=False draws all noise in one batched sampler
call and reshapes — statistically identical (i.i.d. samples by
construction; backed by Kolmogorov–Smirnov integration tests across
every Process × Innovations combination), but 5–7× faster for
Gaussian / Student-t and ~60× faster for Bootstrap at typical scales.
The trade-off is that the seed column reports a single
parent-derived seed shared across all simulations rather than one per
stream — fine for any consumer that doesn’t introspect per-simulation
seeds. The metrics layer (Metrics) defaults to
independent_streams=False internally; reports and live trading
should too unless they specifically need per-stream replayability.
Note
The two modes are statistically equivalent under i.i.d. innovations: the underlying RNG’s i.i.d. property is what TestU01 BigCrush validates over long sequences, which the single-batch path uses directly. The per-stream approach instead relies on cycle-spacing independence across child seeds — a probabilistic argument the single-batch path doesn’t need.