| 1 | # Hermes Investigation - 2026-05-17 |
| 2 | |
| 3 | **Context:** "Feels like the data is corrupted - worked for a bit then slowly bled me." Operator-initiated audit after a sustained drawdown. |
| 4 | **Status:** Investigation only. No code changed. Auto-trading remains OFF (paused since 2026-05-14). |
| 5 | **Confidence:** High. The final conclusion is consistent across four independent cuts of the data and required no further reversal. |
| 6 | |
| 7 | --- |
| 8 | |
| 9 | ## Bottom line up front |
| 10 | |
| 11 | 1. **Your instinct was right that something was off - but it is not ongoing corruption or a dead pipeline.** |
| 12 | 2. **The money was bled before 2026-04-21** by an overconfident probability model betting NO on coin-flip markets with bad payout math. |
| 13 | 3. **The 2026-04-21 "MAE-σ floor" change stopped the bleed.** Trades since then are roughly breakeven. The −$157 total P&L and 48% drawdown you see are *old damage still showing on the cumulative chart*, not fresh losses. |
| 14 | 4. **Single-degree (2°F) temperature brackets have no exploitable edge.** They hit ~45% of the time and the payout structure needs a ~66% win rate. No probability model can fix a market with no edge. |
| 15 | 5. **A real bug exists but it never cost money:** three diagnostic columns are never written to the database. That logging gap caused *three separate audits* (2026-04-27, and my own first two passes today) to misdiagnose the problem. It is worth fixing for observability only. |
| 16 | |
| 17 | --- |
| 18 | |
| 19 | ## The numbers that settle it |
| 20 | |
| 21 | ### Era split (the decisive cut) |
| 22 | |
| 23 | | Era | Trades | Avg ensemble prob | P&L | EV/trade | |
| 24 | |---|---|---|---|---| |
| 25 | | Pre-Apr-21 (unclamped Gaussian) | 97 | 0.01-0.14 | **−$160.72** | **−$1.66** | |
| 26 | | Post-Apr-21 (MAE-σ floor active) | 41 | ~0.10 | **+$3.06** | **≈ $0.00** | |
| 27 | |
| 28 | Essentially **100% of the lifetime loss happened before 2026-04-21.** After the MAE-σ floor was added, the bot stopped bleeding. |
| 29 | |
| 30 | ### Why the strategy can't win (the payout math) |
| 31 | |
| 32 | - Narrow-bracket actual hit rate: **62/138 = 44.9%** - these markets are near coin-flips. |
| 33 | - Realized reward:risk: avg win **$3.32**, avg loss **−$6.46** → ratio **0.51**. |
| 34 | - Break-even win rate at that ratio: **1 / (1 + 0.51) ≈ 66%**. |
| 35 | - Bot's actual win rate: **54.3%**. |
| 36 | |
| 37 | You cannot make money betting NO on ~coin-flip events when the payout structure demands a 66% win rate. This is a **market-selection problem, not a model problem.** Single-degree temperature brackets sit below NWS/ensemble forecast resolution and Kalshi prices them efficiently. |
| 38 | |
| 39 | ### What the model actually did wrong |
| 40 | |
| 41 | The Gaussian model (whether fed by NWS point forecast or the ensemble) was systematically **overconfident that the bracket would NOT hit** - it assigned 1-13% hit probability when reality was ~45%. Pre-Apr-21 this was unclamped, so it would say "1% chance" and bet NO at fake 99% confidence → catastrophic. The Apr-21 MAE-σ floor crudely clamped the floor to ~10%, which capped the overconfidence and stopped the catastrophic losses (but did not create a winning strategy - breakeven, not profit). |
| 42 | |
| 43 | --- |
| 44 | |
| 45 | ## The logging gap (real bug, zero P&L impact) |
| 46 | |
| 47 | `INSERT INTO trades` at `main.py:1585` lists its columns explicitly. Three columns that exist in the schema are **not in the INSERT** and are therefore never written: |
| 48 | |
| 49 | - `raw_ensemble_probability` → always NULL |
| 50 | - `model_count` → always DEFAULT 1 |
| 51 | - `models_used` → always '' / NULL |
| 52 | |
| 53 | **Consequence:** anyone (human or AI) inspecting the trade table sees `model_count = 1` and `raw_ensemble_probability = NULL` on every row and concludes "the 31-member ensemble never ran / the pipeline is dead." |
| 54 | |
| 55 | This is false. Live testing on 2026-05-17 confirmed `fetch_ensemble_forecast()` returns a healthy 31-member ensemble (validate=True, ~1.7°F spread). The ensemble works. The columns are simply never recorded. |
| 56 | |
| 57 | **This single gap caused three misdiagnoses:** |
| 58 | - 2026-04-27 audit: blamed a dead `OPENMETEO_PROXY` node, "fixed" it (the fix addressed a non-problem; that memory entry is now flagged invalid). |
| 59 | - This session, pass 1: I repeated the same "ensemble pipeline died" error. |
| 60 | - This session, pass 2: I then over-corrected to "the MAE-σ floor is destroying the signal and causing the bleed" - also wrong (the era split disproves it). |
| 61 | |
| 62 | The honest record of that oscillation is preserved in memory. The final era-split reconciliation is internally consistent and required no further reversal. |
| 63 | |
| 64 | --- |
| 65 | |
| 66 | ## Evidence trail (for review) |
| 67 | |
| 68 | 1. `auto_config.enabled = 0` confirmed; last real trade 2026-05-13 (pre-pause). No trades since the 2026-05-14 fix. No ongoing bleeding. |
| 69 | 2. 138 settled trades, all single-degree temp brackets, ~100% NO side. |
| 70 | 3. Backtest by edge band and the rejected empirical model: **disregard these** - computed before the era split was understood; superseded. |
| 71 | 4. Live pipeline tests: ensemble OK (31 members), NWS OK, `OPENMETEO_PROXY = None` (the old trap is not active). |
| 72 | 5. Era-sliced P&L + clamp regime + ground-truth hit rate (the tables above): mutually consistent, no contradictions. |
| 73 | |
| 74 | Supporting files: `empirical_analysis.py`, `h_ds.psv`. |
| 75 | |
| 76 | --- |
| 77 | |
| 78 | ## Options (no action taken - your call) |
| 79 | |
| 80 | **A. Do nothing / stay paused (lowest risk).** |
| 81 | Hermes is paused and not losing money. The strategy has no edge; "don't trade" is the correct play for a no-edge market. Cost: $0. Benefit: $0. |
| 82 | |
| 83 | **B. Fix the logging gap only.** |
| 84 | Add the 3 missing columns to the INSERT so future audits aren't blind. ~10-line change, no behavioral effect, auto stays OFF. Recommended regardless of strategy decision - it stops the recurring misdiagnosis. |
| 85 | |
| 86 | **C. Pivot to markets that actually have edge (real project).** |
| 87 | Abandon single-degree brackets. Target wider (≥3-5°F) brackets and above/below-threshold markets where |forecast − threshold| is large vs forecast error - the zones where NWS genuinely beats retail. **No historical data exists for these** (the bot only ever traded narrow brackets, and `market_history` is empty), so this *cannot be backtested* - it requires a shadow-mode data-collection window before any capital. Largest effort, only path with a plausible edge. |
| 88 | |
| 89 | **D. Retire Hermes.** |
| 90 | If the appetite for a multi-week rebuild isn't there, the rational move for a no-edge bot is to stop. Funds stay safe. |
| 91 | |
| 92 | **Do NOT:** re-enable auto on bracket markets, or remove the MAE-σ floor. The floor is helping; removing it reproduces the pre-Apr-21 catastrophic bleed. |
| 93 | |
| 94 | --- |
| 95 | |
| 96 | ## Recommendation |
| 97 | |
| 98 | **B now (cheap, stops the misdiagnosis loop), then a deliberate choice between A/C/D - not under time pressure.** The one thing the data is unambiguous about: the current strategy (single-degree brackets) has no edge and should never be re-enabled as-is. Whether to invest in pivot (C) or retire (D) is a question of how much you want to spend chasing a weather-trading edge that, per the quant playbook, exists only in market types Hermes has never actually traded. |