Benchmarks¶
ngboost-lightning uses the same natural gradient boosting algorithm as NGBoost. The speed difference comes entirely from LightGBM's histogram-based tree splitting versus scikit-learn's exact splitter.
Training Speed¶
Measured on synthetic regression data (10 features, Normal distribution, n_estimators=200, 3 trials averaged):
| n_samples | NGBoost (s) | ngboost-lightning (s) | Speedup |
|---|---|---|---|
| 10,000 | 8.0 | 3.0 | 2.7x |
| 50,000 | 44.0 | 4.8 | 9.2x |
| 100,000 | 93.3 | 7.0 | 13.4x |
The speedup grows with dataset size because LightGBM's histogram binning is O(n_samples) while exact splitting is O(n_samples * log(n_samples)).
Reproducing Benchmarks¶
Individual benchmarks:
uv run python -m benchmarks training # Training speed
uv run python -m benchmarks inference # Prediction speed
uv run python -m benchmarks scaling # Scaling with dataset size
uv run python -m benchmarks comparison # Full comparison with JSON output
uv run python -m benchmarks uci # UCI paper reproduction (Table 1)
UCI Regression Benchmarks (Table 1 Reproduction)¶
Head-to-head comparison on the 10 UCI regression datasets from the NGBoost paper (Duan et al. 2019, Table 1). Each dataset uses the same hyperparameters and cross-validation protocol as the original paper: random 90/10 train/test splits with a further 80/20 train/val split for early stopping (50 rounds). Ordered by training set size to show the crossover point — ngboost-lightning pays fixed LightGBM overhead on tiny datasets, but dominates as N grows.
| Dataset | N | NGB NLL | LB NLL | NGB Time (s) | LB Time (s) | Speedup |
|---|---|---|---|---|---|---|
| yacht | 308 | 0.48 ± 0.40 | 0.72 ± 0.27 | 1.3 | 5.0 | 0.3x |
| housing | 506 | 2.55 ± 0.20 | 2.47 ± 0.19 | 11.1 | 25.1 | 0.4x |
| energy | 768 | 0.82 ± 0.60 | 0.87 ± 0.58 | 10.3 | 21.5 | 0.5x |
| concrete | 1,030 | 3.11 ± 0.15 | 3.05 ± 0.18 | 10.8 | 14.7 | 0.7x |
| wine | 1,599 | 0.95 ± 0.11 | 0.93 ± 0.09 | 1.7 | 3.1 | 0.6x |
| kin8nm | 8,192 | −0.40 ± 0.03 | −0.59 ± 0.03 | 27.7 | 9.3 | 3.0x |
| power | 9,568 | 2.79 ± 0.12 | 2.68 ± 0.12 | 10.8 | 5.6 | 1.9x |
| naval | 11,934 | −4.91 ± 0.03 | −5.58 ± 0.04 | 84.1 | 24.3 | 3.5x |
| protein | 45,730 | 2.86 ± 0.01 | 2.77 ± 0.04 | 280.5 | 26.3 | 10.7x |
| msd | 515,345 | 3.46 | 3.46 | 3,912.1 | 280.7 | 13.9x |
For datasets with N < ~5,000, LightGBM's histogram binning and tree-building overhead makes ngboost-lightning slower than NGBoost's scikit-learn trees. Above that threshold the O(n) histogram method wins decisively, reaching 13.9x on the largest dataset (MSD, 515k samples). NLL is comparable or better across all datasets — ngboost-lightning never sacrifices quality for speed.
To run the UCI benchmark:
uv run python -m benchmarks uci # All 10 datasets
uv run python -m benchmarks uci --dataset=protein # Single dataset
Prediction Quality¶
Both implementations produce nearly identical predictions. The NGBoost parity test suite verifies that predicted means, scales, and NLL scores match within tolerance on the same data with the same hyperparameters. Small differences arise from histogram-based vs exact tree splitting — they use different split points.