Zion Boggan zionboggan.com ↗
147 lines · markdown
History for this file →
1
# prediction-market-bot-postmortem
2
 
3
A post-mortem and the supporting evaluation framework for a Kalshi
4
weather-market trading bot that lost money over its first two months of live
5
trading, then was halted, audited, and retired.
6
 
7
The bot itself - the live strategy, order routing, and Kalshi credentials - is
8
not in this repository and won't be. What is in this repository is the part
9
that turned out to be the actually-useful artifact: the framework that
10
catches your own model bleeding money, and the writeup of the audit that
11
caught it.
12
 
13
## What you're looking at
14
 
15
```
16
docs/
17
  INVESTIGATION.md                     The decisive audit. Era-split P&L,
18
                                       payout-math derivation of the
19
                                       impossible win rate, the three
20
                                       cascading misdiagnoses that came
21
                                       before the right answer.
22
  hermes-v4-research-findings-and-fixes.md
23
                                       Earlier research notes: variable
24
                                       fees, ensemble blending, GFS run
25
                                       timing, optimism-tax-on-longshots.
26
                                       Mixed deployed/proposed.
27
  PIVOT_SPEC.md                        The shadow-mode pivot spec that
28
                                       did NOT get built - written to
29
                                       gate any restart of the bot
30
                                       behind a no-capital evaluation
31
                                       window with a pre-committed
32
                                       decision rule.
33
eval/
34
  c4_eval.py                Pre-committed evaluation against the gate:
35
                            pulls shadow predictions, backfills outcomes
36
                            from the Kalshi API, applies a hard liquidity
37
                            filter, scores Brier + EV after fees, emits
38
                            a PASS/FAIL verdict.
39
  empirical_analysis.py     Walk-forward: empirical bracket-hit model
40
                            vs Gaussian. Brier, win rate, EV, total P&L.
41
                            Stdlib only.
42
  effective_exposure.py     "Effective exposure" module - discounts
43
                            near-certain unsettled positions so capital
44
                            isn't blocked by quasi-decided bets.
45
  test_effective_exposure.py  Eight scenario tests against the discount
46
                              logic, including the safety case of a bet
47
                              that flips from winning to losing.
48
```
49
 
50
## The headline lesson (from INVESTIGATION.md)
51
 
52
The bot traded one specific Kalshi market type: single-degree (~2°F)
53
temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21,
54
then a clamp-on-overconfidence patch landed and the next 41 trades came in
55
at roughly +$3 (essentially zero EV).
56
 
57
The payout math:
58
 
59
| Quantity                         | Value      |
60
|----------------------------------|------------|
61
| Actual bracket hit rate          | 62/138 = 44.9% |
62
| Average win                      | +$3.32     |
63
| Average loss                     | -$6.46     |
64
| Realized reward:risk             | 0.51       |
65
| Break-even win rate at that ratio | 66.2%     |
66
| Bot's actual win rate            | 54.3%      |
67
 
68
You cannot make money betting NO on near-coin-flip events when the payout
69
structure demands a 66% win rate. **It is a market-selection problem, not a
70
model problem.** Single-degree brackets sit below NWS/ensemble forecast
71
resolution and Kalshi prices them efficiently.
72
 
73
## Why the audit took three passes
74
 
75
A bug had hidden the truth from three previous audits, including the first
76
two passes of this one. The `INSERT INTO trades` statement omitted three
77
diagnostic columns (`raw_ensemble_probability`, `model_count`, `models_used`),
78
so every trade row showed `model_count = 1` and `raw_ensemble_probability =
79
NULL`. Anyone (human or AI) inspecting the table concluded "the 31-member
80
ensemble pipeline must be dead." It wasn't. The ensemble worked the whole
81
time. The columns were simply never written.
82
 
83
This bug never cost a cent in P&L. It did cost three audit cycles - one
84
landed-and-deployed fix on a non-problem, and two false starts in the
85
investigation itself.
86
 
87
The honest record of that oscillation is preserved in the writeup.
88
 
89
## Why this is the useful part
90
 
91
The trading strategy was the point of the bot, but it had no edge and is
92
not portable to anything. The evaluation framework, the gate-before-restart
93
discipline, and the "eras + payout math" decomposition are portable. They
94
work on any prediction market, any strategy, any bot.
95
 
96
If you are about to ship a bot, the cheapest thing you can do is build
97
`c4_eval.py` first, in shadow mode, with the gate criteria written down
98
*before* you look at the numbers. Then you build the strategy.
99
 
100
## Reproducible: the real trade dataset is committed
101
 
102
The 138 settled trades from the bot's live run are checked in at
103
`data/sample_trades.psv`. This is the actual data behind every number in
104
INVESTIGATION.md. The columns:
105
 
106
```
107
id | ticker | market_title | nws_forecast | side | edge |
108
ensemble_probability | pnl | opened_at | cost | count | avg_price |
109
nws_probability | market_price | grok_probability | actual_outcome |
110
correct | market_type
111
```
112
 
113
Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi
114
internal identifiers, no PII. `empirical_analysis.py` reads this file by
115
default and reproduces the era-split walk-forward backtest:
116
 
117
```bash
118
python eval/empirical_analysis.py
119
```
120
 
121
Expected output includes the empirical-vs-Gaussian P(hit) table, Brier
122
scores, and walk-forward P&L at four edge thresholds. The walk-forward
123
P&L at `min_edge=0.25` reproduces the headline -$104 loss (the bot's
124
live equivalent of this band was -$94, drift accounted for by data
125
joins and the unclamped-Gaussian era).
126
 
127
## Running the eval pieces
128
 
129
`empirical_analysis.py` and the test suite are stdlib only; no install
130
needed.
131
 
132
```bash
133
python eval/empirical_analysis.py
134
 
135
export HERMES_DATASET=path/to/your/own.psv
136
python eval/empirical_analysis.py
137
 
138
# discount-logic tests
139
python eval/test_effective_exposure.py
140
```
141
 
142
`c4_eval.py` expects a SQLite DB and a Kalshi API client; adapt it to
143
your own data sources before running.
144
 
145
## License
146
 
147
MIT. See [LICENSE](LICENSE).