A post-mortem and the supporting evaluation framework for a Kalshi weather-market trading bot that lost money over its first two months of live trading, then was halted, audited, and retired.
A post-mortem and the supporting evaluation framework for a Kalshi weather-market trading bot that lost money over its first two months of live trading, then was halted, audited, and retired.
The bot itself - the live strategy, order routing, and Kalshi credentials - is not in this repository and won't be. What is in this repository is the part that turned out to be the actually-useful artifact: the framework that catches your own model bleeding money, and the writeup of the audit that caught it.
docs/
INVESTIGATION.md The decisive audit. Era-split P&L,
payout-math derivation of the
impossible win rate, the three
cascading misdiagnoses that came
before the right answer.
hermes-v4-research-findings-and-fixes.md
Earlier research notes: variable
fees, ensemble blending, GFS run
timing, optimism-tax-on-longshots.
Mixed deployed/proposed.
PIVOT_SPEC.md The shadow-mode pivot spec that
did NOT get built - written to
gate any restart of the bot
behind a no-capital evaluation
window with a pre-committed
decision rule.
eval/
c4_eval.py Pre-committed evaluation against the gate:
pulls shadow predictions, backfills outcomes
from the Kalshi API, applies a hard liquidity
filter, scores Brier + EV after fees, emits
a PASS/FAIL verdict.
empirical_analysis.py Walk-forward: empirical bracket-hit model
vs Gaussian. Brier, win rate, EV, total P&L.
Stdlib only.
effective_exposure.py "Effective exposure" module - discounts
near-certain unsettled positions so capital
isn't blocked by quasi-decided bets.
test_effective_exposure.py Eight scenario tests against the discount
logic, including the safety case of a bet
that flips from winning to losing.
The bot traded one specific Kalshi market type: single-degree (~2°F) temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21, then a clamp-on-overconfidence patch landed and the next 41 trades came in at roughly +$3 (essentially zero EV).
The payout math:
| Quantity | Value |
|---|---|
| Actual bracket hit rate | 62/138 = 44.9% |
| Average win | +$3.32 |
| Average loss | -$6.46 |
| Realized reward:risk | 0.51 |
| Break-even win rate at that ratio | 66.2% |
| Bot's actual win rate | 54.3% |
You cannot make money betting NO on near-coin-flip events when the payout structure demands a 66% win rate. It is a market-selection problem, not a model problem. Single-degree brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently.
A bug had hidden the truth from three previous audits, including the first
two passes of this one. The INSERT INTO trades statement omitted three
diagnostic columns (raw_ensemble_probability, model_count, models_used),
so every trade row showed model_count = 1 and raw_ensemble_probability = NULL. Anyone (human or AI) inspecting the table concluded "the 31-member
ensemble pipeline must be dead." It wasn't. The ensemble worked the whole
time. The columns were simply never written.
This bug never cost a cent in P&L. It did cost three audit cycles - one landed-and-deployed fix on a non-problem, and two false starts in the investigation itself.
The honest record of that oscillation is preserved in the writeup.
The trading strategy was the point of the bot, but it had no edge and is not portable to anything. The evaluation framework, the gate-before-restart discipline, and the "eras + payout math" decomposition are portable. They work on any prediction market, any strategy, any bot.
If you are about to ship a bot, the cheapest thing you can do is build
c4_eval.py first, in shadow mode, with the gate criteria written down
before you look at the numbers. Then you build the strategy.
The 138 settled trades from the bot's live run are checked in at
data/sample_trades.psv. This is the actual data behind every number in
INVESTIGATION.md. The columns:
id | ticker | market_title | nws_forecast | side | edge |
ensemble_probability | pnl | opened_at | cost | count | avg_price |
nws_probability | market_price | grok_probability | actual_outcome |
correct | market_type
Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi
internal identifiers, no PII. empirical_analysis.py reads this file by
default and reproduces the era-split walk-forward backtest:
python eval/empirical_analysis.py
Expected output includes the empirical-vs-Gaussian P(hit) table, Brier
scores, and walk-forward P&L at four edge thresholds. The walk-forward
P&L at min_edge=0.25 reproduces the headline -$104 loss (the bot's
live equivalent of this band was -$94, drift accounted for by data
joins and the unclamped-Gaussian era).
empirical_analysis.py and the test suite are stdlib only; no install
needed.
python eval/empirical_analysis.py
export HERMES_DATASET=path/to/your/own.psv
python eval/empirical_analysis.py
# discount-logic tests
python eval/test_effective_exposure.py
c4_eval.py expects a SQLite DB and a Kalshi API client; adapt it to
your own data sources before running.
MIT. See LICENSE.