prediction-market-bot-postmortem

A post-mortem and the supporting evaluation framework for a Kalshi weather-market trading bot that lost money over its first two months of live trading, then was halted, audited, and retired.

1 commits First commit Jun 22, 2026 Last commit Jun 22, 2026 (46 minutes ago)

Code Commits Tags

Python 63.1%Markdown 36.9%

Files 10 entries

▸data/
- ◇sample_trades.psv
▸docs/
▸eval/
- ◇c4_eval.py
- ◇effective_exposure.py
- ◇empirical_analysis.py
- ◇test_effective_exposure.py
◇LICENSE
◇README.md

README.md

prediction-market-bot-postmortem

A post-mortem and the supporting evaluation framework for a Kalshi weather-market trading bot that lost money over its first two months of live trading, then was halted, audited, and retired.

The bot itself - the live strategy, order routing, and Kalshi credentials - is not in this repository and won't be. What is in this repository is the part that turned out to be the actually-useful artifact: the framework that catches your own model bleeding money, and the writeup of the audit that caught it.

What you're looking at

docs/
  INVESTIGATION.md                     The decisive audit. Era-split P&L,
                                       payout-math derivation of the
                                       impossible win rate, the three
                                       cascading misdiagnoses that came
                                       before the right answer.
  hermes-v4-research-findings-and-fixes.md
                                       Earlier research notes: variable
                                       fees, ensemble blending, GFS run
                                       timing, optimism-tax-on-longshots.
                                       Mixed deployed/proposed.
  PIVOT_SPEC.md                        The shadow-mode pivot spec that
                                       did NOT get built - written to
                                       gate any restart of the bot
                                       behind a no-capital evaluation
                                       window with a pre-committed
                                       decision rule.
eval/
  c4_eval.py                Pre-committed evaluation against the gate:
                            pulls shadow predictions, backfills outcomes
                            from the Kalshi API, applies a hard liquidity
                            filter, scores Brier + EV after fees, emits
                            a PASS/FAIL verdict.
  empirical_analysis.py     Walk-forward: empirical bracket-hit model
                            vs Gaussian. Brier, win rate, EV, total P&L.
                            Stdlib only.
  effective_exposure.py     "Effective exposure" module - discounts
                            near-certain unsettled positions so capital
                            isn't blocked by quasi-decided bets.
  test_effective_exposure.py  Eight scenario tests against the discount
                              logic, including the safety case of a bet
                              that flips from winning to losing.

The headline lesson (from INVESTIGATION.md)

The bot traded one specific Kalshi market type: single-degree (~2°F) temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21, then a clamp-on-overconfidence patch landed and the next 41 trades came in at roughly +$3 (essentially zero EV).

The payout math:

Quantity	Value
Actual bracket hit rate	62/138 = 44.9%
Average win	+$3.32
Average loss	-$6.46
Realized reward:risk	0.51
Break-even win rate at that ratio	66.2%
Bot's actual win rate	54.3%

You cannot make money betting NO on near-coin-flip events when the payout structure demands a 66% win rate. It is a market-selection problem, not a model problem. Single-degree brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently.

Why the audit took three passes

A bug had hidden the truth from three previous audits, including the first two passes of this one. The INSERT INTO trades statement omitted three diagnostic columns (raw_ensemble_probability, model_count, models_used), so every trade row showed model_count = 1 and raw_ensemble_probability = NULL. Anyone (human or AI) inspecting the table concluded "the 31-member ensemble pipeline must be dead." It wasn't. The ensemble worked the whole time. The columns were simply never written.

This bug never cost a cent in P&L. It did cost three audit cycles - one landed-and-deployed fix on a non-problem, and two false starts in the investigation itself.

The honest record of that oscillation is preserved in the writeup.

Why this is the useful part

The trading strategy was the point of the bot, but it had no edge and is not portable to anything. The evaluation framework, the gate-before-restart discipline, and the "eras + payout math" decomposition are portable. They work on any prediction market, any strategy, any bot.

If you are about to ship a bot, the cheapest thing you can do is build c4_eval.py first, in shadow mode, with the gate criteria written down before you look at the numbers. Then you build the strategy.

Reproducible: the real trade dataset is committed

The 138 settled trades from the bot's live run are checked in at data/sample_trades.psv. This is the actual data behind every number in INVESTIGATION.md. The columns:

id | ticker | market_title | nws_forecast | side | edge |
ensemble_probability | pnl | opened_at | cost | count | avg_price |
nws_probability | market_price | grok_probability | actual_outcome |
correct | market_type

Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi internal identifiers, no PII. empirical_analysis.py reads this file by default and reproduces the era-split walk-forward backtest:

python eval/empirical_analysis.py

Expected output includes the empirical-vs-Gaussian P(hit) table, Brier scores, and walk-forward P&L at four edge thresholds. The walk-forward P&L at min_edge=0.25 reproduces the headline -$104 loss (the bot's live equivalent of this band was -$94, drift accounted for by data joins and the unclamped-Gaussian era).

Running the eval pieces

empirical_analysis.py and the test suite are stdlib only; no install needed.

python eval/empirical_analysis.py

export HERMES_DATASET=path/to/your/own.psv
python eval/empirical_analysis.py

# discount-logic tests
python eval/test_effective_exposure.py

c4_eval.py expects a SQLite DB and a Kalshi API client; adapt it to your own data sources before running.

License

MIT. See LICENSE.