README.md · prediction-market-bot-postmortem

147 lines · markdown

# prediction-market-bot-postmortem
 
A post-mortem and the supporting evaluation framework for a Kalshi
weather-market trading bot that lost money over its first two months of live
trading, then was halted, audited, and retired.
 
The bot itself - the live strategy, order routing, and Kalshi credentials - is
not in this repository and won't be. What is in this repository is the part
that turned out to be the actually-useful artifact: the framework that
catches your own model bleeding money, and the writeup of the audit that
caught it.
 
## What you're looking at
 
```
docs/
  INVESTIGATION.md                     The decisive audit. Era-split P&L,
                                       payout-math derivation of the
                                       impossible win rate, the three
                                       cascading misdiagnoses that came
                                       before the right answer.
  hermes-v4-research-findings-and-fixes.md
                                       Earlier research notes: variable
                                       fees, ensemble blending, GFS run
                                       timing, optimism-tax-on-longshots.
                                       Mixed deployed/proposed.
  PIVOT_SPEC.md                        The shadow-mode pivot spec that
                                       did NOT get built - written to
                                       gate any restart of the bot
                                       behind a no-capital evaluation
                                       window with a pre-committed
                                       decision rule.
eval/
  c4_eval.py                Pre-committed evaluation against the gate:
                            pulls shadow predictions, backfills outcomes
                            from the Kalshi API, applies a hard liquidity
                            filter, scores Brier + EV after fees, emits
                            a PASS/FAIL verdict.
  empirical_analysis.py     Walk-forward: empirical bracket-hit model
                            vs Gaussian. Brier, win rate, EV, total P&L.
                            Stdlib only.
  effective_exposure.py     "Effective exposure" module - discounts
                            near-certain unsettled positions so capital
                            isn't blocked by quasi-decided bets.
  test_effective_exposure.py  Eight scenario tests against the discount
                              logic, including the safety case of a bet
                              that flips from winning to losing.
```
 
## The headline lesson (from INVESTIGATION.md)
 
The bot traded one specific Kalshi market type: single-degree (~2°F)
temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21,
then a clamp-on-overconfidence patch landed and the next 41 trades came in
at roughly +$3 (essentially zero EV).
 
The payout math:
 
| Quantity                         | Value      |
|----------------------------------|------------|
| Actual bracket hit rate          | 62/138 = 44.9% |
| Average win                      | +$3.32     |
| Average loss                     | -$6.46     |
| Realized reward:risk             | 0.51       |
| Break-even win rate at that ratio | 66.2%     |
| Bot's actual win rate            | 54.3%      |
 
You cannot make money betting NO on near-coin-flip events when the payout
structure demands a 66% win rate. **It is a market-selection problem, not a
model problem.** Single-degree brackets sit below NWS/ensemble forecast
resolution and Kalshi prices them efficiently.
 
## Why the audit took three passes
 
A bug had hidden the truth from three previous audits, including the first
two passes of this one. The `INSERT INTO trades` statement omitted three
diagnostic columns (`raw_ensemble_probability`, `model_count`, `models_used`),
so every trade row showed `model_count = 1` and `raw_ensemble_probability =
NULL`. Anyone (human or AI) inspecting the table concluded "the 31-member
ensemble pipeline must be dead." It wasn't. The ensemble worked the whole
time. The columns were simply never written.
 
This bug never cost a cent in P&L. It did cost three audit cycles - one
landed-and-deployed fix on a non-problem, and two false starts in the
investigation itself.
 
The honest record of that oscillation is preserved in the writeup.
 
## Why this is the useful part
 
The trading strategy was the point of the bot, but it had no edge and is
not portable to anything. The evaluation framework, the gate-before-restart
discipline, and the "eras + payout math" decomposition are portable. They
work on any prediction market, any strategy, any bot.
 
If you are about to ship a bot, the cheapest thing you can do is build
`c4_eval.py` first, in shadow mode, with the gate criteria written down
*before* you look at the numbers. Then you build the strategy.
 
## Reproducible: the real trade dataset is committed
 
The 138 settled trades from the bot's live run are checked in at
`data/sample_trades.psv`. This is the actual data behind every number in
INVESTIGATION.md. The columns:
 
```
id | ticker | market_title | nws_forecast | side | edge |
ensemble_probability | pnl | opened_at | cost | count | avg_price |
nws_probability | market_price | grok_probability | actual_outcome |
correct | market_type
```
 
Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi
internal identifiers, no PII. `empirical_analysis.py` reads this file by
default and reproduces the era-split walk-forward backtest:
 
```bash
python eval/empirical_analysis.py
```
 
Expected output includes the empirical-vs-Gaussian P(hit) table, Brier
scores, and walk-forward P&L at four edge thresholds. The walk-forward
P&L at `min_edge=0.25` reproduces the headline -$104 loss (the bot's
live equivalent of this band was -$94, drift accounted for by data
joins and the unclamped-Gaussian era).
 
## Running the eval pieces
 
`empirical_analysis.py` and the test suite are stdlib only; no install
needed.
 
```bash
python eval/empirical_analysis.py
 
export HERMES_DATASET=path/to/your/own.psv
python eval/empirical_analysis.py
 
# discount-logic tests
python eval/test_effective_exposure.py
```
 
`c4_eval.py` expects a SQLite DB and a Kalshi API client; adapt it to
your own data sources before running.
 
## License
 
MIT. See [LICENSE](LICENSE).

1	# prediction-market-bot-postmortem
2
3	A post-mortem and the supporting evaluation framework for a Kalshi
4	weather-market trading bot that lost money over its first two months of live
5	trading, then was halted, audited, and retired.
6
7	The bot itself - the live strategy, order routing, and Kalshi credentials - is
8	not in this repository and won't be. What is in this repository is the part
9	that turned out to be the actually-useful artifact: the framework that
10	catches your own model bleeding money, and the writeup of the audit that
11	caught it.
12
13	## What you're looking at
14
15	```
16	docs/
17	INVESTIGATION.md The decisive audit. Era-split P&L,
18	payout-math derivation of the
19	impossible win rate, the three
20	cascading misdiagnoses that came
21	before the right answer.
22	hermes-v4-research-findings-and-fixes.md
23	Earlier research notes: variable
24	fees, ensemble blending, GFS run
25	timing, optimism-tax-on-longshots.
26	Mixed deployed/proposed.
27	PIVOT_SPEC.md The shadow-mode pivot spec that
28	did NOT get built - written to
29	gate any restart of the bot
30	behind a no-capital evaluation
31	window with a pre-committed
32	decision rule.
33	eval/
34	c4_eval.py Pre-committed evaluation against the gate:
35	pulls shadow predictions, backfills outcomes
36	from the Kalshi API, applies a hard liquidity
37	filter, scores Brier + EV after fees, emits
38	a PASS/FAIL verdict.
39	empirical_analysis.py Walk-forward: empirical bracket-hit model
40	vs Gaussian. Brier, win rate, EV, total P&L.
41	Stdlib only.
42	effective_exposure.py "Effective exposure" module - discounts
43	near-certain unsettled positions so capital
44	isn't blocked by quasi-decided bets.
45	test_effective_exposure.py Eight scenario tests against the discount
46	logic, including the safety case of a bet
47	that flips from winning to losing.
48	```
49
50	## The headline lesson (from INVESTIGATION.md)
51
52	The bot traded one specific Kalshi market type: single-degree (~2°F)
53	temperature brackets, betting NO. It lost $160 in 97 trades pre-2026-04-21,
54	then a clamp-on-overconfidence patch landed and the next 41 trades came in
55	at roughly +$3 (essentially zero EV).
56
57	The payout math:
58
59	\| Quantity \| Value \|
60	\|----------------------------------\|------------\|
61	\| Actual bracket hit rate \| 62/138 = 44.9% \|
62	\| Average win \| +$3.32 \|
63	\| Average loss \| -$6.46 \|
64	\| Realized reward:risk \| 0.51 \|
65	\| Break-even win rate at that ratio \| 66.2% \|
66	\| Bot's actual win rate \| 54.3% \|
67
68	You cannot make money betting NO on near-coin-flip events when the payout
69	structure demands a 66% win rate. **It is a market-selection problem, not a
70	model problem.** Single-degree brackets sit below NWS/ensemble forecast
71	resolution and Kalshi prices them efficiently.
72
73	## Why the audit took three passes
74
75	A bug had hidden the truth from three previous audits, including the first
76	two passes of this one. The `INSERT INTO trades` statement omitted three
77	diagnostic columns (`raw_ensemble_probability`, `model_count`, `models_used`),
78	so every trade row showed `model_count = 1` and `raw_ensemble_probability =
79	NULL`. Anyone (human or AI) inspecting the table concluded "the 31-member
80	ensemble pipeline must be dead." It wasn't. The ensemble worked the whole
81	time. The columns were simply never written.
82
83	This bug never cost a cent in P&L. It did cost three audit cycles - one
84	landed-and-deployed fix on a non-problem, and two false starts in the
85	investigation itself.
86
87	The honest record of that oscillation is preserved in the writeup.
88
89	## Why this is the useful part
90
91	The trading strategy was the point of the bot, but it had no edge and is
92	not portable to anything. The evaluation framework, the gate-before-restart
93	discipline, and the "eras + payout math" decomposition are portable. They
94	work on any prediction market, any strategy, any bot.
95
96	If you are about to ship a bot, the cheapest thing you can do is build
97	`c4_eval.py` first, in shadow mode, with the gate criteria written down
98	before you look at the numbers. Then you build the strategy.
99
100	## Reproducible: the real trade dataset is committed
101
102	The 138 settled trades from the bot's live run are checked in at
103	`data/sample_trades.psv`. This is the actual data behind every number in
104	INVESTIGATION.md. The columns:
105
106	```
107	id \| ticker \| market_title \| nws_forecast \| side \| edge \|
108	ensemble_probability \| pnl \| opened_at \| cost \| count \| avg_price \|
109	nws_probability \| market_price \| grok_probability \| actual_outcome \|
110	correct \| market_type
111	```
112
113	Public Kalshi market tickers; no account IDs, no API tokens, no Kalshi
114	internal identifiers, no PII. `empirical_analysis.py` reads this file by
115	default and reproduces the era-split walk-forward backtest:
116
117	```bash
118	python eval/empirical_analysis.py
119	```
120
121	Expected output includes the empirical-vs-Gaussian P(hit) table, Brier
122	scores, and walk-forward P&L at four edge thresholds. The walk-forward
123	P&L at `min_edge=0.25` reproduces the headline -$104 loss (the bot's
124	live equivalent of this band was -$94, drift accounted for by data
125	joins and the unclamped-Gaussian era).
126
127	## Running the eval pieces
128
129	`empirical_analysis.py` and the test suite are stdlib only; no install
130	needed.
131
132	```bash
133	python eval/empirical_analysis.py
134
135	export HERMES_DATASET=path/to/your/own.psv
136	python eval/empirical_analysis.py
137
138	# discount-logic tests
139	python eval/test_effective_exposure.py
140	```
141
142	`c4_eval.py` expects a SQLite DB and a Kalshi API client; adapt it to
143	your own data sources before running.
144
145	## License
146
147	MIT. See [LICENSE](LICENSE).