| 1 | # prediction-market-bot-postmortem |
| 2 | |
| 3 | A post-mortem and the supporting evaluation framework for a Kalshi |
| 4 | weather-market trading bot that lost money over its first two months of live |
| 5 | trading, then was halted, audited, and retired. |
| 6 | |
| 7 | The bot itself - the live strategy, order routing, and Kalshi credentials - is |
| 8 | not in this repository and won't be. What is in this repository is the part |
| 9 | that turned out to be the actually-useful artifact: the framework that |
| 10 | catches your own model bleeding money, and the writeup of the audit that |
| 11 | caught it. |
| 12 | |
| 13 | ## What you're looking at |
| 14 | |
| 15 | ``` |
| 16 | docs/ |
| 17 | INVESTIGATION.md The decisive audit. Era-split P&L, |
| 18 | payout-math derivation of the |
| 19 | impossible win rate, the three |
| 20 | cascading misdiagnoses that came |
| 21 | before the right answer. |
| 22 | hermes-v4-research-findings-and-fixes.md |
| 23 | Earlier research notes: variable |
| 24 | fees, ensemble blending, GFS run |
| 25 | timing, optimism-tax-on-longshots. |
| 26 | Mixed deployed/proposed. |
| 27 | PIVOT_SPEC.md The shadow-mode pivot spec that |
| 28 | did NOT get built - written to |
| 29 | gate any restart of the bot |
| 30 | behind a no-capital evaluation |
| 31 | window with a pre-committed |
| 32 | decision rule. |
| 33 | eval/ |
| 34 | c4_eval.py Pre-committed evaluation against the gate: |
| 35 | pulls shadow predictions, backfills outcomes |
| 36 | from the Kalshi API, applies a hard liquidity |
| 37 | filter, scores Brier + EV after fees, emits |
| 38 | a PASS/FAIL verdict. |
| 39 | empirical_analysis.py Walk-forward: empirical bracket-hit model |
| 40 | vs Gaussian. Brier, win rate, EV, total P&L. |
| 41 | Stdlib only. |
| 42 | effective_exposure.py "Effective exposure" module - discounts |
| 43 | near-certain unsettled positions so capital |
| 44 | isn't blocked by quasi-decided bets. |
| 45 | test_effective_exposure.py Eight scenario tests against the discount |
| 46 | logic, including the safety case of a bet |
| 47 | that flips from winning to losing. |
| 48 | ``` |
| 49 | |
| 50 | ## The headline lesson (from INVESTIGATION.md) |
| 51 | |
| 52 | The bot traded one specific Kalshi market type: single-degree (~2°F) |
| 53 | temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21, |
| 54 | then a clamp-on-overconfidence patch landed and the next 41 trades came in |
| 55 | at roughly +$3 (essentially zero EV). |
| 56 | |
| 57 | The payout math: |
| 58 | |
| 59 | | Quantity | Value | |
| 60 | |----------------------------------|------------| |
| 61 | | Actual bracket hit rate | 62/138 = 44.9% | |
| 62 | | Average win | +$3.32 | |
| 63 | | Average loss | -$6.46 | |
| 64 | | Realized reward:risk | 0.51 | |
| 65 | | Break-even win rate at that ratio | 66.2% | |
| 66 | | Bot's actual win rate | 54.3% | |
| 67 | |
| 68 | You cannot make money betting NO on near-coin-flip events when the payout |
| 69 | structure demands a 66% win rate. **It is a market-selection problem, not a |
| 70 | model problem.** Single-degree brackets sit below NWS/ensemble forecast |
| 71 | resolution and Kalshi prices them efficiently. |
| 72 | |
| 73 | ## Why the audit took three passes |
| 74 | |
| 75 | A bug had hidden the truth from three previous audits, including the first |
| 76 | two passes of this one. The `INSERT INTO trades` statement omitted three |
| 77 | diagnostic columns (`raw_ensemble_probability`, `model_count`, `models_used`), |
| 78 | so every trade row showed `model_count = 1` and `raw_ensemble_probability = |
| 79 | NULL`. Anyone (human or AI) inspecting the table concluded "the 31-member |
| 80 | ensemble pipeline must be dead." It wasn't. The ensemble worked the whole |
| 81 | time. The columns were simply never written. |
| 82 | |
| 83 | This bug never cost a cent in P&L. It did cost three audit cycles - one |
| 84 | landed-and-deployed fix on a non-problem, and two false starts in the |
| 85 | investigation itself. |
| 86 | |
| 87 | The honest record of that oscillation is preserved in the writeup. |
| 88 | |
| 89 | ## Why this is the useful part |
| 90 | |
| 91 | The trading strategy was the point of the bot, but it had no edge and is |
| 92 | not portable to anything. The evaluation framework, the gate-before-restart |
| 93 | discipline, and the "eras + payout math" decomposition are portable. They |
| 94 | work on any prediction market, any strategy, any bot. |
| 95 | |
| 96 | If you are about to ship a bot, the cheapest thing you can do is build |
| 97 | `c4_eval.py` first, in shadow mode, with the gate criteria written down |
| 98 | *before* you look at the numbers. Then you build the strategy. |
| 99 | |
| 100 | ## Reproducible: the real trade dataset is committed |
| 101 | |
| 102 | The 138 settled trades from the bot's live run are checked in at |
| 103 | `data/sample_trades.psv`. This is the actual data behind every number in |
| 104 | INVESTIGATION.md. The columns: |
| 105 | |
| 106 | ``` |
| 107 | id | ticker | market_title | nws_forecast | side | edge | |
| 108 | ensemble_probability | pnl | opened_at | cost | count | avg_price | |
| 109 | nws_probability | market_price | grok_probability | actual_outcome | |
| 110 | correct | market_type |
| 111 | ``` |
| 112 | |
| 113 | Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi |
| 114 | internal identifiers, no PII. `empirical_analysis.py` reads this file by |
| 115 | default and reproduces the era-split walk-forward backtest: |
| 116 | |
| 117 | ```bash |
| 118 | python eval/empirical_analysis.py |
| 119 | ``` |
| 120 | |
| 121 | Expected output includes the empirical-vs-Gaussian P(hit) table, Brier |
| 122 | scores, and walk-forward P&L at four edge thresholds. The walk-forward |
| 123 | P&L at `min_edge=0.25` reproduces the headline -$104 loss (the bot's |
| 124 | live equivalent of this band was -$94, drift accounted for by data |
| 125 | joins and the unclamped-Gaussian era). |
| 126 | |
| 127 | ## Running the eval pieces |
| 128 | |
| 129 | `empirical_analysis.py` and the test suite are stdlib only; no install |
| 130 | needed. |
| 131 | |
| 132 | ```bash |
| 133 | python eval/empirical_analysis.py |
| 134 | |
| 135 | export HERMES_DATASET=path/to/your/own.psv |
| 136 | python eval/empirical_analysis.py |
| 137 | |
| 138 | # discount-logic tests |
| 139 | python eval/test_effective_exposure.py |
| 140 | ``` |
| 141 | |
| 142 | `c4_eval.py` expects a SQLite DB and a Kalshi API client; adapt it to |
| 143 | your own data sources before running. |
| 144 | |
| 145 | ## License |
| 146 | |
| 147 | MIT. See [LICENSE](LICENSE). |