docs/PIVOT_SPEC.md · prediction-market-bot-postmortem

49 lines · markdown

# Hermes Pivot - Wider-Edge Market Selection (C1 spec)
 
**Date:** 2026-05-18
**Status:** Design spec for review. Implemented behind SHADOW MODE only (no capital). Auto remains OFF.
**Premise (from INVESTIGATION.md):** single-degree (~2°F) brackets have no edge - hit ~45%, reward:risk 0.51, break-even WR ~66%. The Gaussian model is *reliable* where the signal is strong; it only fails on sub-resolution brackets. The pivot trades **only where the model is actually trustworthy** and proves it on shadow data before risking money.
 
---
 
## Where edge actually exists (kalshi-quant playbook)
 
1. **Above/below-threshold markets with a large forecast-vs-threshold gap.** If NWS forecasts 88°F and the market is "85°+ ?", with day-1 MAE ~2.5°F that's a ~Hermes% YES - and retail frequently misprices the tails. This is the highest-confidence zone. Currently **disabled** (`ENABLE_THRESHOLD_MARKETS=False`) because the *broken pre-Apr-14 model* lost on them - not because the market type is bad.
2. **Wide brackets (≥ ~5°F span).** A 5°F bracket vs ~2.5-3°F forecast σ has real, well-estimated probability mass. The Gaussian works here; it only breaks on 1-2°F brackets below forecast resolution.
 
## Selection rules (the spec)
 
| Rule | Value | Rationale |
|---|---|---|
| Threshold markets | **ENABLE**, but trade only if `\|nws_forecast − threshold\| ≥ GAP_K × horizon_city_mae` | Only act when the forecast is confidently on one side (genuine edge). Inside that band = noise, skip. |
| `GAP_K` | **1.5** (tunable from shadow data) | ~1.5σ ≈ forecast clearly past the threshold. |
| Brackets | **width ≥ MIN_BRACKET_WIDTH** | Kill the no-edge narrow-bracket trap entirely. |
| `MIN_BRACKET_WIDTH` | **5.0°F** | Below this the Gaussian + market efficiency leave no edge (proven, 138-trade history). |
| MAE-σ floor | **KEEP (unchanged)** | It is what stopped the pre-Apr-21 bleed. Do NOT remove. |
| Cities | **all 5 enabled for shadow** | LA/Denver/Miami were disabled on broken-model data; re-collect cleanly. Re-judge per-city from shadow EV. |
| `low` markets | **enabled for shadow** | Same - disabled on broken data; re-evaluate. |
| MIN_EDGE pre-filter | **NOT applied in shadow** | Log every evaluated market with its computed edge so the optimal threshold is chosen from data, not guessed. |
 
## Shadow mode (C2 - safety-critical)
 
- Scanner **runs** (scan → ensemble → probability → edge → would-be decision) even though `auto_config.enabled=0`.
- Every evaluated market is written to `predictions` + `market_history` (currently empty) with: ensemble prob, raw (pre-clamp) prob, model_count, NWS forecast, threshold/bracket, computed edge, would-be side, market price, timestamp. Resolution backfilled by the existing settle loop.
- **Hard guard at the lowest level:** `kalshi_place_order()` becomes a structural no-op when `SHADOW_MODE` is set - it cannot reach the Kalshi order API regardless of any flag, loop, or config. Defense in depth, not a single boolean.
- No Discord "traded" announcements in shadow; a periodic "shadow scan logged N markets" summary instead.
 
## ⚠️ Liquidity caveat (observed 2026-05-18, first shadow scans)
 
Most markets the scanner logs are **px = $0.01** with the model claiming 0.18-0.64 "edge". These are deep-longshot / illiquid Kalshi markets - the optimism-tax trap (kalshi-quant: never buy <$0.10). The huge edges are almost certainly artifacts of model overconfidence on near-zero-price contracts, NOT real alpha. **C4 must hard-filter `px ≥ $0.10` and require non-trivial volume before computing any EV**, or it will "discover" a fake edge and repeat the whole bleed cycle. This is now the single biggest risk to the pivot evaluation.
 
## Decision gate (C4 - before any capital)
 
After **≥30 resolved shadow predictions** on the new market set (post liquidity filter):
- Brier < 0.25 out-of-sample (real skill), AND
- Simulated EV/trade clearly positive after Kalshi fees at a defensible edge threshold, AND
- Holds within the highest-volume city/market-type subset (not one lucky cell).
 
If it passes → propose (to user, explicitly) a small-size live pilot. If it fails → iterate rules or retire. **Never auto-enable.**
 
## Why this is the only honest path
 
There is **zero historical data** on threshold/wide markets (the bot only ever traded narrow brackets; `market_history` is empty). The pivot therefore cannot be backtested - it must be forward-validated in shadow. This spec makes that collection safe and the success criteria explicit *before* collection starts, so the evaluation can't be rationalized after the fact.

1	# Hermes Pivot - Wider-Edge Market Selection (C1 spec)
2
3	Date: 2026-05-18
4	Status: Design spec for review. Implemented behind SHADOW MODE only (no capital). Auto remains OFF.
5	Premise (from INVESTIGATION.md): single-degree (~2°F) brackets have no edge - hit ~45%, reward:risk 0.51, break-even WR ~66%. The Gaussian model is reliable where the signal is strong; it only fails on sub-resolution brackets. The pivot trades only where the model is actually trustworthy and proves it on shadow data before risking money.
6
7	---
8
9	## Where edge actually exists (kalshi-quant playbook)
10
11	1. Above/below-threshold markets with a large forecast-vs-threshold gap. If NWS forecasts 88°F and the market is "85°+ ?", with day-1 MAE ~2.5°F that's a ~Hermes% YES - and retail frequently misprices the tails. This is the highest-confidence zone. Currently disabled (`ENABLE_THRESHOLD_MARKETS=False`) because the broken pre-Apr-14 model lost on them - not because the market type is bad.
12	2. Wide brackets (≥ ~5°F span). A 5°F bracket vs ~2.5-3°F forecast σ has real, well-estimated probability mass. The Gaussian works here; it only breaks on 1-2°F brackets below forecast resolution.
13
14	## Selection rules (the spec)
15
16	\| Rule \| Value \| Rationale \|
17	\|---\|---\|---\|
18	\| Threshold markets \| ENABLE, but trade only if `\\|nws_forecast − threshold\\| ≥ GAP_K × horizon_city_mae` \| Only act when the forecast is confidently on one side (genuine edge). Inside that band = noise, skip. \|
19	\| `GAP_K` \| 1.5 (tunable from shadow data) \| ~1.5σ ≈ forecast clearly past the threshold. \|
20	\| Brackets \| width ≥ MIN_BRACKET_WIDTH \| Kill the no-edge narrow-bracket trap entirely. \|
21	\| `MIN_BRACKET_WIDTH` \| 5.0°F \| Below this the Gaussian + market efficiency leave no edge (proven, 138-trade history). \|
22	\| MAE-σ floor \| KEEP (unchanged) \| It is what stopped the pre-Apr-21 bleed. Do NOT remove. \|
23	\| Cities \| all 5 enabled for shadow \| LA/Denver/Miami were disabled on broken-model data; re-collect cleanly. Re-judge per-city from shadow EV. \|
24	\| `low` markets \| enabled for shadow \| Same - disabled on broken data; re-evaluate. \|
25	\| MIN_EDGE pre-filter \| NOT applied in shadow \| Log every evaluated market with its computed edge so the optimal threshold is chosen from data, not guessed. \|
26
27	## Shadow mode (C2 - safety-critical)
28
29	- Scanner runs (scan → ensemble → probability → edge → would-be decision) even though `auto_config.enabled=0`.
30	- Every evaluated market is written to `predictions` + `market_history` (currently empty) with: ensemble prob, raw (pre-clamp) prob, model_count, NWS forecast, threshold/bracket, computed edge, would-be side, market price, timestamp. Resolution backfilled by the existing settle loop.
31	- Hard guard at the lowest level: `kalshi_place_order()` becomes a structural no-op when `SHADOW_MODE` is set - it cannot reach the Kalshi order API regardless of any flag, loop, or config. Defense in depth, not a single boolean.
32	- No Discord "traded" announcements in shadow; a periodic "shadow scan logged N markets" summary instead.
33
34	## ⚠️ Liquidity caveat (observed 2026-05-18, first shadow scans)
35
36	Most markets the scanner logs are px = $0.01 with the model claiming 0.18-0.64 "edge". These are deep-longshot / illiquid Kalshi markets - the optimism-tax trap (kalshi-quant: never buy <$0.10). The huge edges are almost certainly artifacts of model overconfidence on near-zero-price contracts, NOT real alpha. C4 must hard-filter `px ≥ $0.10` and require non-trivial volume before computing any EV, or it will "discover" a fake edge and repeat the whole bleed cycle. This is now the single biggest risk to the pivot evaluation.
37
38	## Decision gate (C4 - before any capital)
39
40	After ≥30 resolved shadow predictions on the new market set (post liquidity filter):
41	- Brier < 0.25 out-of-sample (real skill), AND
42	- Simulated EV/trade clearly positive after Kalshi fees at a defensible edge threshold, AND
43	- Holds within the highest-volume city/market-type subset (not one lucky cell).
44
45	If it passes → propose (to user, explicitly) a small-size live pilot. If it fails → iterate rules or retire. Never auto-enable.
46
47	## Why this is the only honest path
48
49	There is zero historical data on threshold/wide markets (the bot only ever traded narrow brackets; `market_history` is empty). The pivot therefore cannot be backtested - it must be forward-validated in shadow. This spec makes that collection safe and the success criteria explicit before collection starts, so the evaluation can't be rationalized after the fact.