docs/INVESTIGATION.md · prediction-market-bot-postmortem

98 lines · markdown

# Hermes Investigation - 2026-05-17
 
**Context:** "Feels like the data is corrupted - worked for a bit then slowly bled me." Operator-initiated audit after a sustained drawdown.
**Status:** Investigation only. No code changed. Auto-trading remains OFF (paused since 2026-05-14).
**Confidence:** High. The final conclusion is consistent across four independent cuts of the data and required no further reversal.
 
---
 
## Bottom line up front
 
1. **Your instinct was right that something was off - but it is not ongoing corruption or a dead pipeline.**
2. **The money was bled before 2026-04-21** by an overconfident probability model betting NO on coin-flip markets with bad payout math.
3. **The 2026-04-21 "MAE-σ floor" change stopped the bleed.** Trades since then are roughly breakeven. The −$157 total P&L and 48% drawdown you see are *old damage still showing on the cumulative chart*, not fresh losses.
4. **Single-degree (2°F) temperature brackets have no exploitable edge.** They hit ~45% of the time and the payout structure needs a ~66% win rate. No probability model can fix a market with no edge.
5. **A real bug exists but it never cost money:** three diagnostic columns are never written to the database. That logging gap caused *three separate audits* (2026-04-27, and my own first two passes today) to misdiagnose the problem. It is worth fixing for observability only.
 
---
 
## The numbers that settle it
 
### Era split (the decisive cut)
 
| Era | Trades | Avg ensemble prob | P&L | EV/trade |
|---|---|---|---|---|
| Pre-Apr-21 (unclamped Gaussian) | 97 | 0.01-0.14 | **−$160.72** | **−$1.66** |
| Post-Apr-21 (MAE-σ floor active) | 41 | ~0.10 | **+$3.06** | **≈ $0.00** |
 
Essentially **100% of the lifetime loss happened before 2026-04-21.** After the MAE-σ floor was added, the bot stopped bleeding.
 
### Why the strategy can't win (the payout math)
 
- Narrow-bracket actual hit rate: **62/138 = 44.9%** - these markets are near coin-flips.
- Realized reward:risk: avg win **$3.32**, avg loss **−$6.46** → ratio **0.51**.
- Break-even win rate at that ratio: **1 / (1 + 0.51) ≈ 66%**.
- Bot's actual win rate: **54.3%**.
 
You cannot make money betting NO on ~coin-flip events when the payout structure demands a 66% win rate. This is a **market-selection problem, not a model problem.** Single-degree temperature brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently.
 
### What the model actually did wrong
 
The Gaussian model (whether fed by NWS point forecast or the ensemble) was systematically **overconfident that the bracket would NOT hit** - it assigned 1-13% hit probability when reality was ~45%. Pre-Apr-21 this was unclamped, so it would say "1% chance" and bet NO at fake 99% confidence → catastrophic. The Apr-21 MAE-σ floor crudely clamped the floor to ~10%, which capped the overconfidence and stopped the catastrophic losses (but did not create a winning strategy - breakeven, not profit).
 
---
 
## The logging gap (real bug, zero P&L impact)
 
`INSERT INTO trades` at `main.py:1585` lists its columns explicitly. Three columns that exist in the schema are **not in the INSERT** and are therefore never written:
 
- `raw_ensemble_probability` → always NULL
- `model_count` → always DEFAULT 1
- `models_used` → always '' / NULL
 
**Consequence:** anyone (human or AI) inspecting the trade table sees `model_count = 1` and `raw_ensemble_probability = NULL` on every row and concludes "the 31-member ensemble never ran / the pipeline is dead."
 
This is false. Live testing on 2026-05-17 confirmed `fetch_ensemble_forecast()` returns a healthy 31-member ensemble (validate=True, ~1.7°F spread). The ensemble works. The columns are simply never recorded.
 
**This single gap caused three misdiagnoses:**
- 2026-04-27 audit: blamed a dead `OPENMETEO_PROXY` node, "fixed" it (the fix addressed a non-problem; that memory entry is now flagged invalid).
- This session, pass 1: I repeated the same "ensemble pipeline died" error.
- This session, pass 2: I then over-corrected to "the MAE-σ floor is destroying the signal and causing the bleed" - also wrong (the era split disproves it).
 
The honest record of that oscillation is preserved in memory. The final era-split reconciliation is internally consistent and required no further reversal.
 
---
 
## Evidence trail (for review)
 
1. `auto_config.enabled = 0` confirmed; last real trade 2026-05-13 (pre-pause). No trades since the 2026-05-14 fix. No ongoing bleeding.
2. 138 settled trades, all single-degree temp brackets, ~100% NO side.
3. Backtest by edge band and the rejected empirical model: **disregard these** - computed before the era split was understood; superseded.
4. Live pipeline tests: ensemble OK (31 members), NWS OK, `OPENMETEO_PROXY = None` (the old trap is not active).
5. Era-sliced P&L + clamp regime + ground-truth hit rate (the tables above): mutually consistent, no contradictions.
 
Supporting files: `empirical_analysis.py`, `h_ds.psv`.
 
---
 
## Options (no action taken - your call)
 
**A. Do nothing / stay paused (lowest risk).**
Hermes is paused and not losing money. The strategy has no edge; "don't trade" is the correct play for a no-edge market. Cost: $0. Benefit: $0.
 
**B. Fix the logging gap only.**
Add the 3 missing columns to the INSERT so future audits aren't blind. ~10-line change, no behavioral effect, auto stays OFF. Recommended regardless of strategy decision - it stops the recurring misdiagnosis.
 
**C. Pivot to markets that actually have edge (real project).**
Abandon single-degree brackets. Target wider (≥3-5°F) brackets and above/below-threshold markets where |forecast − threshold| is large vs forecast error - the zones where NWS genuinely beats retail. **No historical data exists for these** (the bot only ever traded narrow brackets, and `market_history` is empty), so this *cannot be backtested* - it requires a shadow-mode data-collection window before any capital. Largest effort, only path with a plausible edge.
 
**D. Retire Hermes.**
If the appetite for a multi-week rebuild isn't there, the rational move for a no-edge bot is to stop. Funds stay safe.
 
**Do NOT:** re-enable auto on bracket markets, or remove the MAE-σ floor. The floor is helping; removing it reproduces the pre-Apr-21 catastrophic bleed.
 
---
 
## Recommendation
 
**B now (cheap, stops the misdiagnosis loop), then a deliberate choice between A/C/D - not under time pressure.** The one thing the data is unambiguous about: the current strategy (single-degree brackets) has no edge and should never be re-enabled as-is. Whether to invest in pivot (C) or retire (D) is a question of how much you want to spend chasing a weather-trading edge that, per the quant playbook, exists only in market types Hermes has never actually traded.

1	# Hermes Investigation - 2026-05-17
2
3	Context: "Feels like the data is corrupted - worked for a bit then slowly bled me." Operator-initiated audit after a sustained drawdown.
4	Status: Investigation only. No code changed. Auto-trading remains OFF (paused since 2026-05-14).
5	Confidence: High. The final conclusion is consistent across four independent cuts of the data and required no further reversal.
6
7	---
8
9	## Bottom line up front
10
11	1. Your instinct was right that something was off - but it is not ongoing corruption or a dead pipeline.
12	2. The money was bled before 2026-04-21 by an overconfident probability model betting NO on coin-flip markets with bad payout math.
13	3. The 2026-04-21 "MAE-σ floor" change stopped the bleed. Trades since then are roughly breakeven. The −$157 total P&L and 48% drawdown you see are old damage still showing on the cumulative chart, not fresh losses.
14	4. Single-degree (2°F) temperature brackets have no exploitable edge. They hit ~45% of the time and the payout structure needs a ~66% win rate. No probability model can fix a market with no edge.
15	5. A real bug exists but it never cost money: three diagnostic columns are never written to the database. That logging gap caused three separate audits (2026-04-27, and my own first two passes today) to misdiagnose the problem. It is worth fixing for observability only.
16
17	---
18
19	## The numbers that settle it
20
21	### Era split (the decisive cut)
22
23	\| Era \| Trades \| Avg ensemble prob \| P&L \| EV/trade \|
24	\|---\|---\|---\|---\|---\|
25	\| Pre-Apr-21 (unclamped Gaussian) \| 97 \| 0.01-0.14 \| −$160.72 \| −$1.66 \|
26	\| Post-Apr-21 (MAE-σ floor active) \| 41 \| ~0.10 \| +$3.06 \| ≈ $0.00 \|
27
28	Essentially 100% of the lifetime loss happened before 2026-04-21. After the MAE-σ floor was added, the bot stopped bleeding.
29
30	### Why the strategy can't win (the payout math)
31
32	- Narrow-bracket actual hit rate: 62/138 = 44.9% - these markets are near coin-flips.
33	- Realized reward:risk: avg win $3.32, avg loss −$6.46 → ratio 0.51.
34	- Break-even win rate at that ratio: 1 / (1 + 0.51) ≈ 66%.
35	- Bot's actual win rate: 54.3%.
36
37	You cannot make money betting NO on ~coin-flip events when the payout structure demands a 66% win rate. This is a market-selection problem, not a model problem. Single-degree temperature brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently.
38
39	### What the model actually did wrong
40
41	The Gaussian model (whether fed by NWS point forecast or the ensemble) was systematically overconfident that the bracket would NOT hit - it assigned 1-13% hit probability when reality was ~45%. Pre-Apr-21 this was unclamped, so it would say "1% chance" and bet NO at fake 99% confidence → catastrophic. The Apr-21 MAE-σ floor crudely clamped the floor to ~10%, which capped the overconfidence and stopped the catastrophic losses (but did not create a winning strategy - breakeven, not profit).
42
43	---
44
45	## The logging gap (real bug, zero P&L impact)
46
47	`INSERT INTO trades` at `main.py:1585` lists its columns explicitly. Three columns that exist in the schema are not in the INSERT and are therefore never written:
48
49	- `raw_ensemble_probability` → always NULL
50	- `model_count` → always DEFAULT 1
51	- `models_used` → always '' / NULL
52
53	Consequence: anyone (human or AI) inspecting the trade table sees `model_count = 1` and `raw_ensemble_probability = NULL` on every row and concludes "the 31-member ensemble never ran / the pipeline is dead."
54
55	This is false. Live testing on 2026-05-17 confirmed `fetch_ensemble_forecast()` returns a healthy 31-member ensemble (validate=True, ~1.7°F spread). The ensemble works. The columns are simply never recorded.
56
57	This single gap caused three misdiagnoses:
58	- 2026-04-27 audit: blamed a dead `OPENMETEO_PROXY` node, "fixed" it (the fix addressed a non-problem; that memory entry is now flagged invalid).
59	- This session, pass 1: I repeated the same "ensemble pipeline died" error.
60	- This session, pass 2: I then over-corrected to "the MAE-σ floor is destroying the signal and causing the bleed" - also wrong (the era split disproves it).
61
62	The honest record of that oscillation is preserved in memory. The final era-split reconciliation is internally consistent and required no further reversal.
63
64	---
65
66	## Evidence trail (for review)
67
68	1. `auto_config.enabled = 0` confirmed; last real trade 2026-05-13 (pre-pause). No trades since the 2026-05-14 fix. No ongoing bleeding.
69	2. 138 settled trades, all single-degree temp brackets, ~100% NO side.
70	3. Backtest by edge band and the rejected empirical model: disregard these - computed before the era split was understood; superseded.
71	4. Live pipeline tests: ensemble OK (31 members), NWS OK, `OPENMETEO_PROXY = None` (the old trap is not active).
72	5. Era-sliced P&L + clamp regime + ground-truth hit rate (the tables above): mutually consistent, no contradictions.
73
74	Supporting files: `empirical_analysis.py`, `h_ds.psv`.
75
76	---
77
78	## Options (no action taken - your call)
79
80	A. Do nothing / stay paused (lowest risk).
81	Hermes is paused and not losing money. The strategy has no edge; "don't trade" is the correct play for a no-edge market. Cost: $0. Benefit: $0.
82
83	B. Fix the logging gap only.
84	Add the 3 missing columns to the INSERT so future audits aren't blind. ~10-line change, no behavioral effect, auto stays OFF. Recommended regardless of strategy decision - it stops the recurring misdiagnosis.
85
86	C. Pivot to markets that actually have edge (real project).
87	Abandon single-degree brackets. Target wider (≥3-5°F) brackets and above/below-threshold markets where \|forecast − threshold\| is large vs forecast error - the zones where NWS genuinely beats retail. No historical data exists for these (the bot only ever traded narrow brackets, and `market_history` is empty), so this cannot be backtested - it requires a shadow-mode data-collection window before any capital. Largest effort, only path with a plausible edge.
88
89	D. Retire Hermes.
90	If the appetite for a multi-week rebuild isn't there, the rational move for a no-edge bot is to stop. Funds stay safe.
91
92	Do NOT: re-enable auto on bracket markets, or remove the MAE-σ floor. The floor is helping; removing it reproduces the pre-Apr-21 catastrophic bleed.
93
94	---
95
96	## Recommendation
97
98	B now (cheap, stops the misdiagnosis loop), then a deliberate choice between A/C/D - not under time pressure. The one thing the data is unambiguous about: the current strategy (single-degree brackets) has no edge and should never be re-enabled as-is. Whether to invest in pivot (C) or retire (D) is a question of how much you want to spend chasing a weather-trading edge that, per the quant playbook, exists only in market types Hermes has never actually traded.