| 1 | # Hermes v4.0 - Research Findings & Pending Fixes |
| 2 | |
| 3 | **Source:** `kalshi-weather-research.pdf` (compiled March 25, 2026) |
| 4 | **Date:** 2026-03-26 (updated 2026-04-03) |
| 5 | **Status:** Fixes 1-2-7 implemented. Bracket fix patch deployed 2026-04-01. Net EV filter deployed 2026-04-03. |
| 6 | |
| 7 | --- |
| 8 | |
| 9 | ## 2026-04-03 - Net EV Dust Trade Filter (Deployed) |
| 10 | |
| 11 | **Root cause found:** Kelly sizing with dampeners can produce bets that buy only 1 contract |
| 12 | on high-priced markets (e.g. KXHIGHLAX-26APR03-T74: 1 contract @ $0.64 = $0.31 net EV). |
| 13 | These dust trades waste one of 8 daily trade slots for pennies of expected value. |
| 14 | |
| 15 | **Fix deployed (CT-REDACTED + CT-REDACTED):** |
| 16 | |
| 17 | | Fix | What | Why | |
| 18 | |-----|------|-----| |
| 19 | | `MIN_TRADE_EV_PCT = 0.0015` | New guardrail constant (0.15% of bankroll) | Scales threshold with account size | |
| 20 | | Net EV gate | `net_ev = contracts × (bet_prob - ask_price) - fees` | Filters on expected dollars, not arbitrary contract counts | |
| 21 | | Skip + log | Skipped trades logged with EV, threshold, contract count | Full audit trail in scan actions / Discord | |
| 22 | |
| 23 | **Why net EV over simpler alternatives:** |
| 24 | - Min contracts floor (e.g. 3): crude, doesn't account for price differences |
| 25 | - Raised min bet ($2): ignores whether the trade actually generates value |
| 26 | - Net EV: directly measures expected dollars per trade slot, captures fee drag, scales with bankroll |
| 27 | |
| 28 | **Thresholds at current balances:** |
| 29 | - Hermes ($348): min EV = $0.52/trade |
| 30 | - Hermes2 ($49): min EV = $0.07/trade |
| 31 | |
| 32 | --- |
| 33 | |
| 34 | ## 2026-04-01 - Bracket Fix Patch (Deployed) |
| 35 | |
| 36 | **Root cause found:** `_ensemble_gaussian_bracket()` systematically underestimates bracket |
| 37 | probability when the ensemble has converged (sigma < 2°F). Outliers inflate Gaussian sigma, |
| 38 | spreading probability outside the bracket. Example: Chicago high ensemble converged to |
| 39 | 40-41°F range, Gaussian said 31% bracket probability, raw count showed 74%. This inflated |
| 40 | NO edge above the 0.20 veto ceiling, causing risky bracket trades to bypass Sonnet review. |
| 41 | |
| 42 | **4 fixes deployed (CT-REDACTED + CT-REDACTED):** |
| 43 | |
| 44 | | Fix | What | Why | |
| 45 | |-----|------|-----| |
| 46 | | Hybrid bracket prob | `max(Gaussian, raw_count±0.5°F)` | Gaussian helps far-out; raw catches converged | |
| 47 | | Bracket veto trigger | Ensemble mean inside bracket → Sonnet review | Safety net for riskiest bracket scenario | |
| 48 | | Bet sizing hard cap | `min(bet, 8% × bankroll)` after Kelly | Prevents rounding overshoot | |
| 49 | | Raw prob column | `raw_ensemble_probability` in trades table | Audit: separate raw vs calibrated | |
| 50 | |
| 51 | **Rejected after backtesting:** |
| 52 | - ±2°F NWS guard: Blocked 7 winners, 0 losses = -$16.19 net. NWS distance is NOT a predictor of bracket failure. |
| 53 | - METAR entry filter: Dead code. Trades placed 12-30h before observations become informative. |
| 54 | |
| 55 | **Planned (weekend):** Bracket exit monitor - sells positions when 6 gates confirm edge flip. |
| 56 | |
| 57 | **Key data points:** |
| 58 | - Historical bracket NO: 15W/6L, +$5.56 net, 71% win rate |
| 59 | - Losses cluster at NWS 3-10°F from bracket (big forecast busts), NOT near-bracket trades |
| 60 | - The live code already had a 50/50 blend + 5% floor approach - replaced with max() which is more accurate |
| 61 | |
| 62 | --- |
| 63 | |
| 64 | ## Fixes To Implement (Priority Order) |
| 65 | |
| 66 | ### FIX 1: Variable Fee Formula (CRITICAL - blocking profitable trades now) |
| 67 | **Current:** Flat `TAKER_FEE_PER_CONTRACT = 0.05` applied to all trades. |
| 68 | **Correct:** `fee = ceil(0.07 * contracts * price * (1 - price))` |
| 69 | |
| 70 | | Contract Price | Our Fee (flat) | Actual Fee | We're Wrong By | |
| 71 | |---------------|---------------|------------|----------------| |
| 72 | | $0.85 | $0.05 | $0.01 | 5x too high | |
| 73 | | $0.75 | $0.05 | $0.02 | 2.5x too high | |
| 74 | | $0.60 | $0.05 | $0.02 | 2.5x too high | |
| 75 | | $0.50 | $0.05 | $0.02 | 2.5x too high | |
| 76 | | $0.40 | $0.05 | $0.02 | 2.5x too high | |
| 77 | |
| 78 | **Impact:** We're rejecting trades with 7-9% true edge because our inflated fee estimate makes them look below the 10% threshold. This is the single biggest leak in the system right now. |
| 79 | |
| 80 | **Implementation:** |
| 81 | ```python |
| 82 | def kalshi_taker_fee(price, contracts=1): |
| 83 | import math |
| 84 | raw_fee = 0.07 * contracts * price * (1 - price) |
| 85 | return math.ceil(raw_fee * 100) / 100 |
| 86 | ``` |
| 87 | Replace the flat constant everywhere edge is calculated. |
| 88 | |
| 89 | --- |
| 90 | |
| 91 | ### FIX 2: Reduce Scan Interval from 30 Minutes to 5 Minutes |
| 92 | **Current:** Scanner runs every 30 minutes. |
| 93 | **Finding:** The competing bot (suislanchez, $1,325+ profit) scans every 5 minutes. |
| 94 | **Cost:** Zero. 1,440 API calls/day vs 10,000 limit. |
| 95 | **Benefit:** Catches mispricing faster, especially after GFS ensemble releases (data available ~3.5h after initialization at 00Z/06Z/12Z/18Z). |
| 96 | **Risk:** More Sonnet veto calls on Max plan. Mitigated by the filter pipeline - most markets get rejected before reaching Sonnet. |
| 97 | |
| 98 | --- |
| 99 | |
| 100 | ### FIX 3: Add Maker Orders for Better Fees |
| 101 | **Current:** All orders are taker (market) orders. |
| 102 | **Finding:** Maker fee is 25% of taker fee: `ceil(0.0175 * C * P * (1-P))`. At $0.50 contract: taker fee = $0.02, maker fee = $0.01. |
| 103 | **Implementation:** For trades where market is not about to close (>6h to settlement), place limit orders slightly inside the spread instead of taking the ask. Fall back to taker if not filled within 10 minutes. |
| 104 | **Complexity:** Medium - requires order monitoring and cancellation logic. |
| 105 | **Priority:** After fix 1 and 2 are validated. |
| 106 | |
| 107 | --- |
| 108 | |
| 109 | ### FIX 4: Extremized Aggregation (Replace Simple Calibration Multiply) |
| 110 | **Current:** `adj_prob = ens_prob * calibration_multiplier` |
| 111 | **Better:** Combine ensemble + NWS + base rates via log-odds with extremizing factor. |
| 112 | **Research:** Satopaa et al. (2014), Neyman & Roughgarden (2021) - optimal factor ~1.73 for robust aggregation. |
| 113 | **Implementation:** |
| 114 | ```python |
| 115 | def extremize_aggregate(probabilities, weights=None, factor=1.5): |
| 116 | import math |
| 117 | if weights is None: |
| 118 | weights = [1.0 / len(probabilities)] * len(probabilities) |
| 119 | clamped = [max(0.001, min(0.999, p)) for p in probabilities] |
| 120 | log_odds = [math.log(p / (1 - p)) for p in clamped] |
| 121 | avg_lo = sum(w * lo for w, lo in zip(weights, log_odds)) |
| 122 | ext_lo = avg_lo * factor |
| 123 | return 1 / (1 + math.exp(-ext_lo)) |
| 124 | ``` |
| 125 | **Notes:** |
| 126 | - Start with factor 1.5 (conservative - ensemble members share model physics, high info overlap) |
| 127 | - Weights: 0.5 ensemble, 0.3 NWS, 0.2 historical base rate |
| 128 | - Factor for weather should be lower than geopolitical (1.73) because ensemble members aren't independent |
| 129 | **Priority:** After 30 trades validate the current system works. |
| 130 | |
| 131 | --- |
| 132 | |
| 133 | ### FIX 5: Rain Ensemble Bias Correction |
| 134 | **Finding:** GFS ensemble over-forecasts light precipitation (false alarm rate too high). Raw member counting overestimates "any rain" probability. |
| 135 | **Source:** Zhu & Luo (2015), "Precipitation Calibration Based on the Frequency-Matching Method" |
| 136 | **Implementation:** Maintain rolling 30-day comparison of ensemble rain probability vs observed rain for each city. Apply frequency-matching correction. |
| 137 | **Priority:** After accumulating 20+ KXRAIN settlements to establish baseline bias. |
| 138 | |
| 139 | --- |
| 140 | |
| 141 | ### FIX 6: City-Specific Low Temp Adjustments |
| 142 | **Finding:** Overnight lows are harder to forecast than highs due to radiative cooling, inversions, UHI effects. |
| 143 | **Risk ranking:** |
| 144 | | City | Low Temp Risk | Reason | |
| 145 | |------|-------------|--------| |
| 146 | | Denver | HIGHEST | Altitude + inversions + dry air + DEN airport 24mi from downtown | |
| 147 | | Chicago | HIGH | Lake effect + continental + inversion potential | |
| 148 | | NYC | MEDIUM | UHI 8F+, airport vs Manhattan can differ 5F+ at night | |
| 149 | | LA | MEDIUM | LAX coastal vs inland can differ 10-15F on summer nights | |
| 150 | | Miami | LOWEST | Tropical maritime limits radiative cooling | |
| 151 | |
| 152 | **Implementation:** Apply higher minimum edge for KXLOWT than KXHIGH. Possible: 12% for low vs 10% for high, with Denver KXLOWT at 15%. |
| 153 | **Priority:** Can implement now as a constant, tune after data. |
| 154 | |
| 155 | --- |
| 156 | |
| 157 | ### FIX 7: Reduce Edge Threshold from 10% to 8% |
| 158 | **Finding:** The suislanchez bot uses 8% edge threshold and is profitable. With the variable fee fix (Fix 1), our true edge calculation becomes more accurate, so a lower threshold is justified. |
| 159 | **Current:** `MIN_EDGE = 0.10` |
| 160 | **Proposed:** `MIN_EDGE = 0.08` (matches competitor) |
| 161 | **Caveat:** Only after Fix 1 (variable fees) is implemented. With the flat $0.05 fee, 8% would let in bad trades. |
| 162 | **Priority:** Implement together with Fix 1. |
| 163 | |
| 164 | --- |
| 165 | |
| 166 | ### FIX 8: GFS Ensemble Release-Aware Scanning |
| 167 | **Finding:** GFS ensemble data becomes available ~3.5h after initialization: |
| 168 | | Run | Init (UTC) | Data Available | CDT | |
| 169 | |-----|-----------|----------------|-----| |
| 170 | | 00Z | 00:00 | ~03:30 UTC | 10:30 PM | |
| 171 | | 06Z | 06:00 | ~09:30 UTC | 4:30 AM | |
| 172 | | 12Z | 12:00 | ~15:30 UTC | 10:30 AM | |
| 173 | | 18Z | 18:00 | ~21:30 UTC | 4:30 PM | |
| 174 | |
| 175 | **Implementation:** After switching to 5-minute scans, no special timing needed - the bot naturally picks up new data. But could log which GFS run the ensemble came from for calibration purposes. |
| 176 | **Priority:** Low - 5-minute scans handle this implicitly. |
| 177 | |
| 178 | --- |
| 179 | |
| 180 | ## Research Findings (Reference - No Code Changes Needed) |
| 181 | |
| 182 | ### FINDING 1: Kalshi Balance Earns 3.75-4% APY |
| 183 | Kalshi pays yield on total account balance. At $60 this is negligible ($2.25/year), but at $500+ it becomes a consideration - idle cash isn't fully idle. |
| 184 | |
| 185 | ### FINDING 2: Maker Fee History - Rounding Exploit Was Real |
| 186 | Before July 2025, maker fees were flat $0.0025/contract. On $0.02 contracts, this rounded to $0.01 - a 50% effective fee. Kalshi fixed this. Current variable formula eliminates the rounding issue. |
| 187 | |
| 188 | ### FINDING 3: Post-2024 Kalshi Regime Change |
| 189 | Before 2024 Q4, takers made money on average. After Kalshi's legal victory and volume explosion ($30M to $820M/quarter), professional market makers entered. Takers now lose on average. Our edge MUST come from better information (ensemble data), not from market structure. |
| 190 | |
| 191 | ### FINDING 4: Weather is in the "Other" Category at ~10% of Volume |
| 192 | Weather/climate is Kalshi's original niche but only ~10% of total notional volume. Lower volume = potentially wider spreads but also less competition from sophisticated market makers. |
| 193 | |
| 194 | ### FINDING 5: Longshot Bias Confirmed with Kalshi Data |
| 195 | 72.1M trade analysis confirms: contracts below 10% implied probability consistently underperform for buyers. Our $0.40 price floor already exploits this by forcing us to trade in the 40-99 cent range where mispricing exists without the longshot trap. |
| 196 | |
| 197 | ### FINDING 6: Becker Dataset Available for Backtesting |
| 198 | Full Parquet dataset at github.com/Jon-Becker/prediction-market-analysis. Could filter to weather tickers and compute actual historical mispricing, time-of-day effects, and pre/post ensemble release patterns. |
| 199 | |
| 200 | ### FINDING 7: NBM May Be Superior to Raw GFS Ensemble |
| 201 | The National Blend of Models (NBM v4.3) already applies bias correction + quantile mapping to GFS/GEFS/HRRR/ECMWF. Could be used as a third probability source alongside ensemble and NWS for extremized aggregation. |
| 202 | |
| 203 | ### FINDING 8: Fan & van den Dool (2011) Key Results |
| 204 | - GFS ensemble 2m temp errors are dominated by large-scale spatial patterns |
| 205 | - 30-day mean forecast errors produce more robust bias corrections than 7-day means |
| 206 | - Cold season shows more removable bias than warm season |
| 207 | - ~60% of total error variance captured by leading EOF modes |
| 208 | |
| 209 | ### FINDING 9: UHI Effect Is Larger at Night |
| 210 | Urban Heat Island effect is 2-5F warmer at night (more than daytime 1-7F). Since Kalshi settles on airport METAR stations (often outside UHI), the model grid cell may include urban warming the airport doesn't see. This creates a systematic warm bias in low temp ensemble forecasts for urban stations. |
| 211 | |
| 212 | ### FINDING 10: GEFS Reforecast Data Available on AWS |
| 213 | `s3://noaa-gefs-retrospective/GEFSv12/reforecast/` - could build city-specific MAE/bias tables from 2000-present instead of waiting for live trade data. 11 ensemble members (vs 31 operational) but large historical sample. |
| 214 | |
| 215 | --- |
| 216 | |
| 217 | ## Open-Source Bot Comparison |
| 218 | |
| 219 | | Aspect | suislanchez Bot | Hermes v4 | |
| 220 | |--------|----------------|-----------| |
| 221 | | Profit | $1,325+ confirmed | $0 (just deployed) | |
| 222 | | Edge threshold | 8% | 10% (should lower to 8%) | |
| 223 | | Scan interval | 5 minutes | 30 minutes (should lower to 5) | |
| 224 | | Kelly fraction | 15% (0.15x) | Variable (0.125-0.375x by confidence) | |
| 225 | | Markets | KXHIGH only | KXHIGH + KXLOWT + KXRAIN | |
| 226 | | Fee model | Unknown | Variable formula (pending fix) | |
| 227 | | NWS cross-check | No | Yes (high, low, rain) | |
| 228 | | Sonnet veto gate | No | Yes | |
| 229 | | Correlation guard | No | Yes (2 per city/date) | |
| 230 | | Calibration | Brier only | Per-type per-city per-season | |
| 231 | |
| 232 | --- |
| 233 | |
| 234 | ## Implementation Order |
| 235 | |
| 236 | | Phase | Fixes | When | |
| 237 | |-------|-------|------| |
| 238 | | **Now** | Fix 1 (variable fees) + Fix 7 (lower threshold to 8%) | Immediate | |
| 239 | | **This week** | Fix 2 (5-min scans) + Fix 6 (city-specific low temp adjustments) | After Fix 1 validated | |
| 240 | | **After 30 trades** | Fix 4 (extremized aggregation) + Fix 5 (rain bias correction) | Need data first | |
| 241 | | **After 50 trades** | Fix 3 (maker orders) + Fix 8 (release-aware logging) | Optimization phase | |