093cdd6 · Prediction Market Bot Postmortem

add PIVOT SPEC (+1 more)

093cdd6 Zion Boggan committed on May 18, 2026 (1 month ago)

docs/PIVOT_SPEC.md +49 -0

		@@ -0,0 +1,49 @@
	+	# Hermes Pivot - Wider-Edge Market Selection (C1 spec)
	+
	+	Date: 2026-05-18
	+	Status: Design spec for review. Implemented behind SHADOW MODE only (no capital). Auto remains OFF.
	+	Premise (from INVESTIGATION.md): single-degree (~2°F) brackets have no edge - hit ~45%, reward:risk 0.51, break-even WR ~66%. The Gaussian model is reliable where the signal is strong; it only fails on sub-resolution brackets. The pivot trades only where the model is actually trustworthy and proves it on shadow data before risking money.
	+
	+	---
	+
	+	## Where edge actually exists (kalshi-quant playbook)
	+
	+	1. Above/below-threshold markets with a large forecast-vs-threshold gap. If NWS forecasts 88°F and the market is "85°+ ?", with day-1 MAE ~2.5°F that's a ~Hermes% YES - and retail frequently misprices the tails. This is the highest-confidence zone. Currently disabled (`ENABLE_THRESHOLD_MARKETS=False`) because the broken pre-Apr-14 model lost on them - not because the market type is bad.
	+	2. Wide brackets (≥ ~5°F span). A 5°F bracket vs ~2.5-3°F forecast σ has real, well-estimated probability mass. The Gaussian works here; it only breaks on 1-2°F brackets below forecast resolution.
	+
	+	## Selection rules (the spec)
	+
	+	\| Rule \| Value \| Rationale \|
	+	\|---\|---\|---\|
	+	\| Threshold markets \| ENABLE, but trade only if `\\|nws_forecast − threshold\\| ≥ GAP_K × horizon_city_mae` \| Only act when the forecast is confidently on one side (genuine edge). Inside that band = noise, skip. \|
	+	\| `GAP_K` \| 1.5 (tunable from shadow data) \| ~1.5σ ≈ forecast clearly past the threshold. \|
	+	\| Brackets \| width ≥ MIN_BRACKET_WIDTH \| Kill the no-edge narrow-bracket trap entirely. \|
	+	\| `MIN_BRACKET_WIDTH` \| 5.0°F \| Below this the Gaussian + market efficiency leave no edge (proven, 138-trade history). \|
	+	\| MAE-σ floor \| KEEP (unchanged) \| It is what stopped the pre-Apr-21 bleed. Do NOT remove. \|
	+	\| Cities \| all 5 enabled for shadow \| LA/Denver/Miami were disabled on broken-model data; re-collect cleanly. Re-judge per-city from shadow EV. \|
	+	\| `low` markets \| enabled for shadow \| Same - disabled on broken data; re-evaluate. \|
	+	\| MIN_EDGE pre-filter \| NOT applied in shadow \| Log every evaluated market with its computed edge so the optimal threshold is chosen from data, not guessed. \|
	+
	+	## Shadow mode (C2 - safety-critical)
	+
	+	- Scanner runs (scan → ensemble → probability → edge → would-be decision) even though `auto_config.enabled=0`.
	+	- Every evaluated market is written to `predictions` + `market_history` (currently empty) with: ensemble prob, raw (pre-clamp) prob, model_count, NWS forecast, threshold/bracket, computed edge, would-be side, market price, timestamp. Resolution backfilled by the existing settle loop.
	+	- Hard guard at the lowest level: `kalshi_place_order()` becomes a structural no-op when `SHADOW_MODE` is set - it cannot reach the Kalshi order API regardless of any flag, loop, or config. Defense in depth, not a single boolean.
	+	- No Discord "traded" announcements in shadow; a periodic "shadow scan logged N markets" summary instead.
	+
	+	## ⚠️ Liquidity caveat (observed 2026-05-18, first shadow scans)
	+
	+	Most markets the scanner logs are px = $0.01 with the model claiming 0.18-0.64 "edge". These are deep-longshot / illiquid Kalshi markets - the optimism-tax trap (kalshi-quant: never buy <$0.10). The huge edges are almost certainly artifacts of model overconfidence on near-zero-price contracts, NOT real alpha. C4 must hard-filter `px ≥ $0.10` and require non-trivial volume before computing any EV, or it will "discover" a fake edge and repeat the whole bleed cycle. This is now the single biggest risk to the pivot evaluation.
	+
	+	## Decision gate (C4 - before any capital)
	+
	+	After ≥30 resolved shadow predictions on the new market set (post liquidity filter):
	+	- Brier < 0.25 out-of-sample (real skill), AND
	+	- Simulated EV/trade clearly positive after Kalshi fees at a defensible edge threshold, AND
	+	- Holds within the highest-volume city/market-type subset (not one lucky cell).
	+
	+	If it passes → propose (to user, explicitly) a small-size live pilot. If it fails → iterate rules or retire. Never auto-enable.
	+
	+	## Why this is the only honest path
	+
	+	There is zero historical data on threshold/wide markets (the bot only ever traded narrow brackets; `market_history` is empty). The pivot therefore cannot be backtested - it must be forward-validated in shadow. This spec makes that collection safe and the success criteria explicit before collection starts, so the evaluation can't be rationalized after the fact.

eval/c4_eval.py +190 -0

		@@ -0,0 +1,190 @@
	+	#!/usr/bin/env python3
	+	"""Hermes C4 - shadow-pivot evaluation (unattended).
	+
	+	Runs on CT-REDACTED via system cron (no Claude session needed). Pulls SHADOW
	+	predictions from hermes.db, backfills outcomes from the Kalshi API, applies a
	+	hard liquidity filter, and scores the pivot against the pre-committed decision
	+	gate. Writes a report + Discord ping. Idempotent; safe to re-run.
	+
	+	Decision gate (from PIVOT_SPEC.md): propose a small live pilot ONLY if, on the
	+	liquid subset with >=30 resolved:
	+	EV/trade after fees > 0 AND Brier < 0.25 AND it holds in the
	+	highest-volume city/market-type cell (not one lucky cluster).
	+	Otherwise: iterate SHADOW_GAP_K / SHADOW_MIN_BRACKET_WIDTH, or retire.
	+	NEVER enables live trading. Never writes auto_config.
	+	"""
	+	import sys, os, json, sqlite3, statistics, datetime, traceback
	+
	+	DB = "$HERMES_HOME/hermes.db"
	+	REPORT_DIR = "$HERMES_HOME"
	+	MIN_PRICE = 0.10
	+	MIN_VOLUME = 20
	+	MIN_RESOLVED = 30
	+	BRIER_GATE = 0.25
	+	sys.path.insert(0, "$HERMES_HOME")
	+
	+	def log(*a):
	+	print(*a, flush=True)
	+
	+	def backfill_outcome(main, ticker):
	+	"""Return 1 if YES settled, 0 if NO, None if still open/unknown."""
	+	try:
	+	m = main.kalshi_get(f"/markets/{ticker}")
	+	mk = (m or {}).get("market") or {}
	+	res = (mk.get("result") or "").lower()
	+	if res == "yes":
	+	return 1
	+	if res == "no":
	+	return 0
	+	return None
	+	except Exception:
	+	return None
	+
	+	def main_eval():
	+	ts = datetime.datetime.now().strftime("%Y%m%d-%H%M")
	+	report_path = os.path.join(REPORT_DIR, f"c4_report_{ts}.txt")
	+	out = []
	+
	+	def emit(s=""):
	+	out.append(s); log(s)
	+
	+	emit(f"=== Hermes C4 evaluation - {datetime.datetime.now().isoformat(timespec='seconds')} ===")
	+
	+	try:
	+	import main
	+	except Exception as e:
	+	emit(f"FATAL: cannot import main.py: {e}")
	+	_write_and_notify(report_path, out, None)
	+	return
	+
	+	con = sqlite3.connect(DB)
	+	con.row_factory = sqlite3.Row
	+	rows = con.execute(
	+	"""SELECT p.id, p.ticker, p.market_title, p.ensemble_probability ep,
	+	p.market_price px, p.edge, p.recommendation rec, p.market_type,
	+	p.actual_outcome, p.predicted_at,
	+	COALESCE(mh.volume_fp, 0) vol
	+	FROM predictions p
	+	LEFT JOIN market_history mh ON mh.ticker = p.ticker
	+	WHERE p.recommendation LIKE 'SHADOW%'
	+	ORDER BY p.id"""
	+	).fetchall()
	+	emit(f"shadow predictions on record: {len(rows)}")
	+	if not rows:
	+	emit("No shadow predictions yet - collection has not produced data. "
	+	"Recommend: verify SHADOW_MODE scanner is running; re-check in ~1 week.")
	+	_write_and_notify(report_path, out, "INSUFFICIENT")
	+	con.close()
	+	return
	+
	+	liquid = [r for r in rows if (r["px"] or 0) >= MIN_PRICE and (r["vol"] or 0) >= MIN_VOLUME]
	+	emit(f"after liquidity filter (px>=${MIN_PRICE:.2f} & vol>={MIN_VOLUME}): {len(liquid)}")
	+
	+	resolved = []
	+	for r in liquid:
	+	ao = r["actual_outcome"]
	+	outcome = None
	+	if ao in (0, 1):
	+	outcome = int(ao)
	+	elif isinstance(ao, str) and ao.strip().lower() in ("yes", "no"):
	+	outcome = 1 if ao.strip().lower() == "yes" else 0
	+	else:
	+	outcome = backfill_outcome(main, r["ticker"])
	+	if outcome is not None:
	+	resolved.append((r, outcome))
	+	emit(f"resolved (liquid): {len(resolved)} / need {MIN_RESOLVED}")
	+
	+	if len(resolved) < MIN_RESOLVED:
	+	emit("")
	+	emit(f"VERDICT: INSUFFICIENT DATA - {len(resolved)} resolved liquid predictions "
	+	f"(< {MIN_RESOLVED}). Do NOT evaluate edge yet. Keep SHADOW_MODE running; "
	+	f"re-run this eval in ~7 days. (Total shadow rows {len(rows)}, "
	+	f"liquid {len(liquid)} - if liquid stays tiny, the pivot's market set "
	+	f"may be structurally illiquid → that itself is a finding.)")
	+	_write_and_notify(report_path, out, "INSUFFICIENT")
	+	con.close()
	+	return
	+
	+	briers, wins, evs = [], 0, []
	+	cell = {}
	+	for r, outcome in resolved:
	+	ep = r["ep"] if r["ep"] is not None else 0.5
	+	briers.append((ep - outcome) ** 2)
	+	side_yes = "YES" in (r["rec"] or "").upper()
	+	px = r["px"] or 0.0
	+	try:
	+	fee = main.kalshi_taker_fee(px)
	+	except Exception:
	+	fee = 0.0
	+	won = (outcome == 1) if side_yes else (outcome == 0)
	+	ev = ((1.0 - px) - fee) if won else (-(px + fee))
	+	evs.append(ev)
	+	wins += 1 if won else 0
	+	city = r["ticker"].split("-")[0]
	+	key = (city, r["market_type"] or "?")
	+	c = cell.setdefault(key, [0, 0, 0.0])
	+	c[0] += 1; c[1] += 1 if won else 0; c[2] += ev
	+
	+	n = len(resolved)
	+	brier = statistics.mean(briers)
	+	wr = wins / n
	+	ev_mean = statistics.mean(evs)
	+	emit("")
	+	emit(f"n={n} WR={wr*100:.1f}% Brier={brier:.4f} EV/trade=${ev_mean:+.3f} (after fees)")
	+	emit("by cell (city, market_type):")
	+	best_cell = None
	+	for key, (cn, cw, cev) in sorted(cell.items(), key=lambda kv: -kv[1][0]):
	+	cev_avg = cev / cn if cn else 0
	+	emit(f" {key[0]:14s} {key[1]:6s} n={cn:3d} WR={cw/cn*100:4.0f}% EV=${cev_avg:+.3f}")
	+	if best_cell is None:
	+	best_cell = (key, cn, cw / cn, cev_avg)
	+
	+	cell_ok = bool(best_cell and best_cell[1] >= 10 and best_cell[3] > 0)
	+	passed = (ev_mean > 0) and (brier < BRIER_GATE) and cell_ok
	+	emit("")
	+	if passed:
	+	emit("VERDICT: GATE PASSED - pivot shows positive EV after fees, Brier "
	+	f"< {BRIER_GATE}, and holds in the highest-volume cell "
	+	f"{best_cell[0]} (n={best_cell[1]}, EV=${best_cell[3]:+.3f}). "
	+	"RECOMMEND: propose a SMALL live pilot to the user. Do NOT auto-enable.")
	+	else:
	+	why = []
	+	if ev_mean <= 0: why.append(f"EV/trade ${ev_mean:+.3f} not > 0")
	+	if brier >= BRIER_GATE: why.append(f"Brier {brier:.3f} not < {BRIER_GATE}")
	+	if not cell_ok: why.append("does not hold in the highest-volume cell")
	+	emit("VERDICT: GATE FAILED - " + "; ".join(why) + ". "
	+	"RECOMMEND: iterate SHADOW_GAP_K / SHADOW_MIN_BRACKET_WIDTH, or retire. "
	+	"Do NOT enable live trading.")
	+	emit("")
	+	emit("(Auto-trading untouched. This script never writes auto_config.)")
	+	con.close()
	+	_write_and_notify(report_path, out, "PASS" if passed else "FAIL")
	+
	+	def _write_and_notify(path, lines, status):
	+	body = "\n".join(lines)
	+	try:
	+	with open(path, "w") as f:
	+	f.write(body + "\n")
	+	except Exception:
	+	pass
	+
	+	try:
	+	import main
	+	hook = getattr(main, "DISCORD_WEBHOOK", None) or os.getenv("DISCORD_WEBHOOK")
	+	if hook:
	+	import urllib.request
	+	tag = {"PASS": "✅", "FAIL": "❌", "INSUFFICIENT": "⏳"}.get(status, "ℹ️")
	+	msg = f"{tag} Hermes C4 eval ({status})\n```\n{body[-1500:]}\n```"
	+	req = urllib.request.Request(
	+	hook, data=json.dumps({"content": msg}).encode(),
	+	headers={"Content-Type": "application/json"})
	+	urllib.request.urlopen(req, timeout=10)
	+	except Exception:
	+	pass
	+
	+	if __name__ == "__main__":
	+	try:
	+	main_eval()
	+	except Exception:
	+	traceback.print_exc()
	+	sys.exit(1)

		@@ -0,0 +1,49 @@
	+	# Hermes Pivot - Wider-Edge Market Selection (C1 spec)
	+
	+	Date: 2026-05-18
	+	Status: Design spec for review. Implemented behind SHADOW MODE only (no capital). Auto remains OFF.
	+	Premise (from INVESTIGATION.md): single-degree (~2°F) brackets have no edge - hit ~45%, reward:risk 0.51, break-even WR ~66%. The Gaussian model is reliable where the signal is strong; it only fails on sub-resolution brackets. The pivot trades only where the model is actually trustworthy and proves it on shadow data before risking money.
	+
	+	---
	+
	+	## Where edge actually exists (kalshi-quant playbook)
	+
	+	1. Above/below-threshold markets with a large forecast-vs-threshold gap. If NWS forecasts 88°F and the market is "85°+ ?", with day-1 MAE ~2.5°F that's a ~Hermes% YES - and retail frequently misprices the tails. This is the highest-confidence zone. Currently disabled (`ENABLE_THRESHOLD_MARKETS=False`) because the broken pre-Apr-14 model lost on them - not because the market type is bad.
	+	2. Wide brackets (≥ ~5°F span). A 5°F bracket vs ~2.5-3°F forecast σ has real, well-estimated probability mass. The Gaussian works here; it only breaks on 1-2°F brackets below forecast resolution.
	+
	+	## Selection rules (the spec)
	+
	+	\| Rule \| Value \| Rationale \|
	+	\|---\|---\|---\|
	+	\| Threshold markets \| ENABLE, but trade only if `\\|nws_forecast − threshold\\| ≥ GAP_K × horizon_city_mae` \| Only act when the forecast is confidently on one side (genuine edge). Inside that band = noise, skip. \|
	+	\| `GAP_K` \| 1.5 (tunable from shadow data) \| ~1.5σ ≈ forecast clearly past the threshold. \|
	+	\| Brackets \| width ≥ MIN_BRACKET_WIDTH \| Kill the no-edge narrow-bracket trap entirely. \|
	+	\| `MIN_BRACKET_WIDTH` \| 5.0°F \| Below this the Gaussian + market efficiency leave no edge (proven, 138-trade history). \|
	+	\| MAE-σ floor \| KEEP (unchanged) \| It is what stopped the pre-Apr-21 bleed. Do NOT remove. \|
	+	\| Cities \| all 5 enabled for shadow \| LA/Denver/Miami were disabled on broken-model data; re-collect cleanly. Re-judge per-city from shadow EV. \|
	+	\| `low` markets \| enabled for shadow \| Same - disabled on broken data; re-evaluate. \|
	+	\| MIN_EDGE pre-filter \| NOT applied in shadow \| Log every evaluated market with its computed edge so the optimal threshold is chosen from data, not guessed. \|
	+
	+	## Shadow mode (C2 - safety-critical)
	+
	+	- Scanner runs (scan → ensemble → probability → edge → would-be decision) even though `auto_config.enabled=0`.
	+	- Every evaluated market is written to `predictions` + `market_history` (currently empty) with: ensemble prob, raw (pre-clamp) prob, model_count, NWS forecast, threshold/bracket, computed edge, would-be side, market price, timestamp. Resolution backfilled by the existing settle loop.
	+	- Hard guard at the lowest level: `kalshi_place_order()` becomes a structural no-op when `SHADOW_MODE` is set - it cannot reach the Kalshi order API regardless of any flag, loop, or config. Defense in depth, not a single boolean.
	+	- No Discord "traded" announcements in shadow; a periodic "shadow scan logged N markets" summary instead.
	+
	+	## ⚠️ Liquidity caveat (observed 2026-05-18, first shadow scans)
	+
	+	Most markets the scanner logs are px = $0.01 with the model claiming 0.18-0.64 "edge". These are deep-longshot / illiquid Kalshi markets - the optimism-tax trap (kalshi-quant: never buy <$0.10). The huge edges are almost certainly artifacts of model overconfidence on near-zero-price contracts, NOT real alpha. C4 must hard-filter `px ≥ $0.10` and require non-trivial volume before computing any EV, or it will "discover" a fake edge and repeat the whole bleed cycle. This is now the single biggest risk to the pivot evaluation.
	+
	+	## Decision gate (C4 - before any capital)
	+
	+	After ≥30 resolved shadow predictions on the new market set (post liquidity filter):
	+	- Brier < 0.25 out-of-sample (real skill), AND
	+	- Simulated EV/trade clearly positive after Kalshi fees at a defensible edge threshold, AND
	+	- Holds within the highest-volume city/market-type subset (not one lucky cell).
	+
	+	If it passes → propose (to user, explicitly) a small-size live pilot. If it fails → iterate rules or retire. Never auto-enable.
	+
	+	## Why this is the only honest path
	+
	+	There is zero historical data on threshold/wide markets (the bot only ever traded narrow brackets; `market_history` is empty). The pivot therefore cannot be backtested - it must be forward-validated in shadow. This spec makes that collection safe and the success criteria explicit before collection starts, so the evaluation can't be rationalized after the fact.