Layer-by-layer anatomy of how Wyman Cove's forecast gets built — and how accurate it's turning out.loading…
Right now — what the pipeline is doing
Temp correction
—
vs raw model
Humidity correction
—
vs raw model
Confidence
—
— stations reporting
Briefing source
—
—
Status — where we are
Last curated: 2026-06-29 v0.6.257 · click any sub-box header to collapse it.
Since last curation (2026-06-29, v0.6.249 → v0.6.257)
✓L5 attribution fix — initial ship absorbed L5 into the L4 column (same bug shape as the earlier L6-into-L2). Now preserves direct_radiation_post_l4 and emits sr_l5 as its own snapshot column. v0.6.249
✓L5 chart + badge wired — sr card now shows an amber L5 line + column + "L5 ✓ synoptic" badge. v0.6.250
✓L5 placeholder section — Layer 5 header + summary + 5a–5d placeholders landed between L4 and L6 in this curation pass. Full subsection build deferred to the specialists refactor. v0.6.251
⚙Cove/L6 disable-gate CLEARED 2/2 on 2026-06-29 — r5_cove_analysis second consecutive HOLD verdict. Morning offshore cooling regime mean Δ = -0.67°F vs -1.0°F threshold; sea-breeze regime still passing strongly at +1.69°F. Holding L6 enabled pending investigation (likely seasonal attenuation of morning offshore signal as solstice approaches).
⚙Walk-forward L3/L4 read #4 (2026-06-29): L3 wants to drop pp + ws (2/7 — ws is new today, streak reset); L4 wants to drop cc (5/7 — 2 reads to clear). No whitelist edits.
✓L3 regime × lead-band analysis — new analysis/l3_regime_lead_analysis.py. Reveals the dominant ws/wg L3 failure is calm forecast wind (<3 mph), not lead distance. L3 LOSES -20% to -69% on the calm cell; WINS +5% to +47% everywhere else. Auto-picks up in the daily digest. v0.6.252
⏸Gated calm-wind L3 skip in decay_apply.py — CALM_GATE_ENABLED=False. When flipped, ws/wg L3 corrections zero out at any lead where forecast wind is below 3.0 mph. Standard Stage 2 promotion: audit a few digest cycles before flipping. v0.6.252
✓L4 regime × lead-band analysis — new analysis/l4_regime_lead_analysis.py applied to L4 fields (ch, cc). ch is unambiguously good (27 WIN / 5 flat / 0 LOSES across every cell). cc has a specific frontal-regime weakness (LOSES at frontal × {6-11h, 12-23h, 24-47h} and ne_flow × 0-5h) but WIN or flat in every other regime — the walk-forward's flat-drop-cc gate at 5/7 is reading regime-specific weakness, not field-wide failure. v0.6.254
✓cm L3 reframing — same diagnostic shows cm L3 is clear WIN at long leads (7 WIN / 1 flat at 24-47h) with regime-specific losses at frontal × {6-11h, 12-23h} and ne_flow × 0-5h. The 06-24 walk-forward "all windows OFF for cm" verdict is contradicted; a per-(regime, lead_band) whitelist is the right move, same shape as ws/wg/cc.
⚙C1 calibration re-curate sanity check (2026-06-29): Re-ran c1_confidence_calibration.py + c1_curate_confidence_table.py locally. Pass rate moved 47.92% → 61.36% — confirms re-curating absorbs real drift. Still under 75% threshold so verdict stays HOLD. Fresh curated tables (32 SHIP / 12 MARGINAL / 12 SKIP) staged for next collector tick. v0.6.254
✗C1d KILLED by orthogonality check (2026-06-29): New h_cloud_disagreement_orthogonality.py ran the σ-HIGH vs σ-LOW comparison while holding C1a (transition) fixed. The σ signal inverted (σ_HIGH/σ_LOW ratio < 1.0 in 3 of 4 cells that cleared the n floor) — meaning yesterday's SMOKE_ALIVE was capturing the regime-transition signal C1a already encodes. C1d shouldn't promote. v0.6.256
Pipeline delta: L5 shipped + audited. L6 held under cleared disable-gate (decision pending). C1d KILLED by orthogonality (signal redundant with C1a). Calm-wind L3 skip gated and ready for flip after Stage 2 audit (~2026-07-02). Pattern emerging: regime × lead_band analyses on ws/wg/cc/cm all show the walk-forward validator's flat-drop verdicts hide regime-specific weakness — per-(field, regime, lead_band) whitelist work queued for Tuesday/Wednesday and supersedes the flat drop-cc and drop-ws paths. C1 calibration moved 47.92% → 61.36% with re-curate; still HOLD. L5 first clean 7-day audit window closes ~2026-07-05. L6 full 7-day window clean by 2026-07-03.
One-line summary: 5-layer correction stack in production. L1 raw HRRR → L2 aggregate-bias correction (Kalman-blended local network: temp, humidity, pressure, wind, cloud) → L3 lead-decay correction (per-lead-hour) → L4 diurnal correction (hour-of-day) → L6 microclimate correction (temperature, conditional on wind octant × sea-breeze × hour). Every "current conditions" card reads the L1→L6-corrected hourly[0] value. L5 synoptic-regime correction (solar) built, gated. C1 confidence layer (first non-MAE-reducing output — named C-class to distinguish from L-class correction layers that change forecast values) built with multi-axis support: C1a regime transition, C1b cluster spread, C1c pressure tendency, C1f precip-forecast presence. Gated pending calibration audit. R2 state-stratified accuracy and R6 regime-transition penalty audits run every Fitter cycle. R3 (derived humidity), R4 (HRRR-GFS spread), R5 (cove cross-current global formulation) retired.
✅ Production stack
L1 → L2 → L3 → L4 → L6 pipeline. L2 applies to t / dp / h / pr / cc (additive bias with Kalman gain) and ws / wg (direct selection).
L6 — microclimate correction (temperature, shipped 2026-06-26). Per-lead Δ°F applied to corrected_temperature: each forecast lead gets the Δ for its own projected regime (forecast wind dir + parsed local hour + heuristic sb_active). Indexed by (wind octant × sea-breeze active × hour-of-day). Built off a 12-day waterfront-vs-inland spatial gradient log (n=1,732) — the first layer trained on a within-network spatial signal rather than aggregate forecast error. Two PASS reads on r5_cove_analysis.py cleared the post-build confirmation gate (sea-breeze +1.55°F warming, morning offshore −1.21°F cooling). Module: weather_collector/processors/cove_correction.py, ENABLED = True.
L3 whitelist: ws, wg, ch, cm, pp. Each field remains enabled only if it continues to beat the layer below on the rolling 7-window held-out audit.
L4 whitelist: ch, cc. cc added 2026-06-24 v0.6.214 after a 2-read ≥3% gate cleared. Most other fields lack stable hour-of-day structure.
L2 lead-decay τ refit live by the Fitter with a per-field guardrail (fitted τ adopted only if it beats default on held-out RMSE and lands within 0.25×–4× of default; otherwise fall back).
L2 wind selection: per-octant max → median across octants for the local network. KBOS+KBVY authoritative-source floor overrides when the airports' median speed exceeds the octant pick by >1.4×. Direction-outlier guardrail rejects chosen direction if it's >60° off the airport+buoy+Tempest consensus. Physical sanity floor enforces gust ≥ wind on the final values.
L2 cloud blend: KBOS+KBVY METAR sky-condition obs (cloud_cover + L/M/H splits) Kalman-blended against HRRR hourly[0]. Cloud-tuned gain: K=0.90 (σ<20pp), K=0.70 (σ<40pp), K=0.50 (σ≥40pp), K=0.35 (single source). Treats disagreement between KBOS (Boston) and KBVY (Beverly) as a real coastal spatial gradient rather than sensor noise. Same K applied to total + splits to keep them self-consistent.
Current conditions: The current-conditions card is populated only after every correction layer has run from weather_data["current"] ← corrected hourly[0], so the displayed "now" exactly matches the corrected forecast pipeline. Preserves condition_source from upstream observation overrides; re-derives weather_code from corrected cloud + precip.
⏸ Gated off — built, not applied
[✅ Shipped 2026-06-28 v0.6.248]L5 — synoptic-regime correction (solar) moved out of this section. Gate cleared 7/7 ship days; solar_correction.ENABLED=True. Live in production.
C1 — multi-axis confidence layer.ENABLED=False in confidence_layer.py. First non-MAE-reducing layer in the stack. Three axes: regime-synoptic transition (state_fc vs state_obs), cluster-spread quartile (Q1/Q23/Q4 from KMAMARBL/KMASALEM/KMASWAMP temp medians), and pressure-tendency bin (falling_fast / falling / flat / rising from state_fc.pressure_trend_hpa_3h). Per (field, lead_band) cell carries per-axis displayed MAE; runtime classifies live axes each tick, looks up the matching cell in c1_confidence_curated_v2.json, falls back to legacy single-axis cell on miss. C1 is evaluated by calibration, not forecast error.
📡 Stage 2 — auto-wired audits (logged every Fitter cycle)
L5 — synoptic-regime correction (solar). Verdict in conditional_audits.l5; recency-weighted MAE since v0.6.178 (was unweighted 30-day average, now exp(-age_days/14) matching the rest of the Fitter). Promotes when 7 consecutive Fitter-cycle days return SHIP; trailing gate auto-computed in l5_gate_history.json since v0.6.180 and surfaced beneath the L5 row in S1 below.
R6 — regime-transition penalty. Verdict in conditional_audits.r6; recency-weighted since v0.6.179 (same fix as L5). Watch-signal for C1a — flags if transition-penalty magnitudes drift.
Marine-layer Stage 2.5 watch. Verdict in conditional_audits.marine_layer_watch since v0.6.182. Per-cycle log of NE+morn (wd 45-105°, hour 4-9 EDT) cc bias; daily visibility on the Sun-morning weekly Stage 2 read.
🗄 Retired — ruled out, kept as institutional memory
R5 — cove cross-current. Verdict HOLD at −20.58% MAE on 32,816 pairs. L2's waterfront-weighted station blend already captures the signal; R5 double-counts. Script analysis/r5_audit.py kept for quarterly re-checks.
R4 — HRRR vs GFS spread. Verdict CLOSE on 112,877 joined pairs; max |ρ|=0.012 across all fields. Spread is not a useful uncertainty signal.
R3 — derived humidity. Magnus(T_corrected, T_d_corrected) equivalent to L2 network-blended humidity within noise (27k triples). Derived path kept for internal consistency; the hypothesis "derivation is more accurate" is closed.
2026-06-27 (Sat) — DONE: Walk-forward read on cc/cl L3/L4 inclusion under KBVY-blended obs. No additions; validator instead wants to drop existing fields. (See 06-29 read below for latest counts.)
2026-06-27 (Sat) — DONE: C1 calibration audit (single + v2). Legacy HOLD at 47.92% pass rate. v2 multi-axis DEFERRED to ~2026-07-04. confidence_layer.ENABLED stays False.
2026-06-28 (Sun) — DONE: C1d (KBOS-vs-KBVY cloud disagreement) returned SMOKE_ALIVE after only 24h of post-wiring rows. Orthogonality follow-up landed 2026-06-29 (next entry).
2026-06-29 (Mon) — DONE:h_cloud_disagreement_orthogonality.py returned KILL C1d. Holding C1a (transition) fixed, the σ_HIGH/σ_LOW MAE ratio inverts to <1.0 in 3 of 4 cells that cleared the n floor — yesterday's SMOKE_ALIVE was capturing the regime-transition signal C1a already encodes. C1e check insufficient (n=0). C1d does not promote.
2026-06-29 (Mon) — DONE: Walk-forward L3/L4 read #4. L3 wants to drop pp + ws (gate 2/7 — ws is new today, streak reset). L4 wants to drop cc (gate 5/7 — 2 reads to clear). No whitelist edits.
2026-06-29 (Mon) — HELD: Cove/L6 disable-gate cleared 2/2 (r5_cove_analysis second consecutive HOLD). Holding L6 enabled while investigating; morning offshore regime weakening (-0.67°F vs -1.0°F threshold) likely seasonal, sea-breeze regime still strong (+1.69°F).
2026-07-02 (Thu) — UPCOMING: Earliest reasonable flip date for CALM_GATE_ENABLED=True in decay_apply.py (calm-wind ws/wg L3 skip). Needs 2–3 daily digest cycles of l3_regime_lead_analysis showing the calm-cell L3 LOSES verdict is stable. If flipped, supersedes the flat drop-ws path the walk-forward gate is currently counting toward.
2026-07-03 (Fri) — UPCOMING: Earliest possible L4 drop-cc gate clear (5/7 + 2 more SHIP-direction reads). Also: L6 full 7-day pair window clean of broken-impl rows; L6 disable-gate maturity check if held this long.
2026-07-04 (Sat) — UPCOMING: C1 v2 multi-axis calibration audit first eligible (cluster_spread log accumulates over ~7 days from build).
2026-07-05 (Sun) — UPCOMING: L5 first clean 7-day audit window closes — first real Fitter audit of L5 vs L4 on solar.
Watch: L6 cove disable-gate. Held but in a cleared state; each subsequent r5_cove_analysis HOLD verdict reinforces the case; a single SHIP flips the gate back. Decision point if signal stays weak through ~2026-07-03.
🧪 Open architectural questions
L2-as-observation-only: Remove L2 from forecast pipeline; keep L2 only as training target for L3/L4/L5. Open question; not today's work. See memory note.
ws L3 long-lead regression — REFRAMED 2026-06-29: New l3_regime_lead_analysis.py reveals the dominant L3 failure mode is calm forecast wind, not lead distance. ws/wg L3 LOSES -20% to -69% MAE when forecast wind <3 mph; WINS +5% to +47% when ≥3 mph. Decay-apply now carries a gated calm-wind skip (CALM_GATE_ENABLED=False in decay_apply.py, v0.6.252) — Stage 2 candidate, audit a few cycles of shadow data before flipping. If gate-flip lands cleanly, this supersedes the flat drop-ws path the walk-forward gate (2/7) is currently counting toward.
Per-regime gating for L6 (cove): Today's r5_cove_analysis (2026-06-29) showed sea-breeze regime passing strongly (+1.69°F, n=353) while morning offshore failed (-0.67°F vs -1.0°F threshold). Disable-gate cleared 2/2 in the disable direction but we're holding because killing the whole layer kills the sea-breeze win too. Open: should L6 apply per-regime instead of all-or-nothing, and should the lookup table refit with recency weighting so seasonal drift (morning marine cooling attenuating toward solstice) gets absorbed instead of fighting the audit? Investigation deferred to post-2026-06-30.
Specialists vs layers naming: L5 and L6 are field-specific specialists (sr-only, t-only), not general-purpose layers like L2/L3/L4. The "Layer N" naming will grow noisy with every new specialist. Plan: group field-specific correctors under a parent "Field-specific corrections (specialists)" section; keep L_N internally for stack order but use plain-language names in headings. Restructure deferred to post-2026-06-30. See project_specialists_vs_layers memory note.
Per-(field, regime, lead_band) whitelist for L3 + L4 — a meta-pattern (2026-06-29): The (regime × lead_band) cross-cut has now been run on four fields (ws, wg, cc, cm) and each time the walk-forward validator's flat-drop verdict would have killed wins in most regimes to fix a regime-specific weakness in one. Discovery rule: always run l3_regime_lead_analysis / l4_regime_lead_analysis before acting on any walk-forward drop recommendation. Implementation question: should decay_apply.py grow a per-(field, regime, lead_band) skip table (analog to the calm-wind gate but conditioned on observed regime instead of forecast wind speed)? That's a larger architectural commit than the calm-wind gate — defer to Tuesday/Wednesday.
How to read the rest of this page.
Forecast accuracy section is the bottom-line MAE per layer per field (the green box on each chart = what users actually see). Layer sections (L1–L4 general-purpose, plus L5 solar and L6 temperature as field-specific specialists) each show the live state of that layer and the diagnostics behind it. L5's section is a placeholder for now — see "Where we are" inside its header — with full subsections coming after the specialists refactor. Research & Diagnostics (bottom) holds R2 (state-stratified accuracy) and R6 (regime-transition penalty, C1a's input signal) under "Active hypotheses"; the curated Stage-1 Backlog list (8 candidates grouped A/B/C); operational tools G1/S1/B1/F1 (gated candidates, shadow whitelist, backtest sweep, frontal events); and a Retired collapsible (R3 derived humidity, R4 HRRR/GFS spread, R5 cove cross-current — all ruled out by audits). The shadow tuner now carries a 7-cycle promotion-gate counter (v0.6.147) — when an unchanged recommendation holds 7 Fitter cycles, it's eligible to weigh into the next walk-forward read.
Conventions: L = correction layer (applied to forecast). R = research hypothesis (logged, not applied). S = shadow tuner (what auto-tuner would do). G = guardrail check / candidate stamp. F = failure/diagnostic. B = backtest. D = drill-down / teaching view.
Forecast accuracy — how accurate is it?
For each field, this shows how far off the forecast tends to be at each future hour — and how much each correction layer narrows that gap vs the bare raw model.
How to read these charts
What you're looking at. One small chart per weather field. The Y-axis is how wrong the forecast tends to be (in that field's units — °F, %, mph, etc.). The X-axis is hours into the future. Lower lines = more accurate. Lines climbing to the right is normal — forecasts get worse the further out you predict.
The lines are cumulative correction layers. L1–L4 stack across every field; L6 only stacks on the temperature card. Lowest line = lowest typical error = what users actually see.
Raw model (gray, dashed): the bare HRRR/GFS forecast for this exact coordinate. Knows nothing about Wyman Cove specifically. The starting point.
Aggregate bias (orange): adds Layer 2 — what 40+ nearby weather stations say the model is currently getting wrong, blended in by distance.
Lead decay (light blue): adds Layer 3 — a historical per-lead bias correction learned from the pair log. Different bias for each lead hour: "this model tends to be 2°F too cool at lead 6, 1°F too warm at lead 12," etc.
Diurnal (bright blue): adds Layer 4 — a hour-of-day bias correction. Different bias for each hour of the day (6am forecasts vs 2pm forecasts get different corrections, regardless of lead time). For every field except temperature and solar radiation, this is the final line.
Synoptic-regime (amber): adds Layer 5 — a per-regime W/m² delta on direct solar radiation. The classifier looks at current wind direction, speed, pressure trend, hour, and temp; the regime label keys into a per-regime calibrated delta. Only present on the solar (sr) card; other fields have no L5 line by construction. For solar, this is the final line. Shipped 2026-06-28 v0.6.248 after a 7/7 gate-clear streak; the L5 MAE window is still catching up to L3/L4 until ~2026-07-05 so comparisons against the Diurnal line aren't apples-to-apples until then.
Microclimate (mint green): adds Layer 6 — a per-lead Δ°F that captures the cove's microclimate via a waterfront-vs-inland spatial gradient. Only present on the temperature card; other fields have no L6 line by construction. For temperature, this is the final line — what users actually see. Note: L6 shipped 2026-06-26, so its MAE averages over a shorter window than L3/L4 until ~2026-07-03 — comparisons against the Diurnal line aren't apples-to-apples until then.
Why lines often lie on top of each other. Two different reasons, and the badges above each chart tell you which:
Layer not applied to this field (badge shows L3 off or similar). The walk-forward validator showed this layer makes things worse for that field, so we turned it off. The line for that layer is identical to the one below it by construction.
Layer is applied but the correction is tiny (badge shows L3 ✓). The raw model is already very close, so the correction barely moves the needle. Lines visually overlap but they're not identical.
L2 badge variants:L2 ✓ additive means a bias is added (temperature, humidity, pressure). L2 ✓ direct means we replace the model wind with what the stations are reading. L2 n/a means no station network applies a bias for this field (clouds, solar, precip). Note for clouds: L2 is still n/a as a forecast correction, but obs truth for cc/cl/cm/ch now comes from a KBOS + KBVY METAR blend (v0.6.134) — feeds the joiner so L3/L4 can be evaluated, not applied at L2.
The table beneath each chart shows the same data as the chart but averaged into lead bands (0-5h, 6-11h, etc.). Numbers in field units. Compare a row across columns to see how each correction layer changes the error in that lead band — this is exactly the bucketing the walk-forward validator uses to decide what gets shipped.
Lead 0 is circular by construction (forecast for now compared against the same-moment mesonet, so L2 ≈ 0); look at lead 1+ for real signal. Source: time_series_diagnostic.json::per_layer_mae_by_lead, 7-day window.
Layer 1 — Raw model (HRRR / GFS)
The bare government weather model. Knows nothing about Wyman Cove specifically. Everything below corrects what it gets wrong here.
About the raw model
The starting point. Open-Meteo's HRRR (next 48h) and GFS (days 3–7) numerical weather models. Multi-kilometer grid resolution; the model has no specific knowledge of Wyman Cove. Every layer below corrects what the raw model gets wrong locally.
Curves below show the current raw model forecast per field — same dotted line that appears in the drill-down above.
Layer 2 — Aggregate-bias correction (local station network)
What our 66 nearby weather stations say the model is getting wrong right now — and how much of that signal we trust.
How the network correction is built
Two parallel aggregation paths under this layer, each suited to the noise behavior of its metric:
Temp / humidity / pressure (additive bias): each station's reading first calibrated against its own chronic offset (Kalman-tracked, rolling 48h, see 2a accordion). Per-octant 1/distance² × exp(-|elev_diff|/30)-weighted mean of (station − model) bias. Final network bias = unweighted mean across non-empty octants. Temperature and humidity additionally get scaled by their own network Kalman gain K (sec 2c) before being added to the raw model — separate K functions per field since temp scatter is in °F and humidity scatter is in %. Pressure is applied at full strength (station consensus matches model after altitude offsets).
Lead-decay: the bias is not applied flat across all 48 lead hours. Each field gets bias_applied(lead) = current_bias × exp(-lead/τ) with a per-field τ. The Fitter (twice daily at 03:xx and 15:xx local) refits τ on a train/test split — pairs older than the last 2 days fit τ; the last 2 days score it. Both fitted τ and held-out RMSE deltas are written to l2_decay.json. The loader applies a per-field guardrail: adopt the fitted τ only if it beat the hardcoded default on held-out RMSE (≥0% improvement, ≥100 test pairs) AND the fitted τ is within 0.25×–4× of the default. Otherwise fall back to the default. Hardcoded defaults: τ_t=4h (temp bias decays fast — useful at short leads, near-zero by 24h), τ_h=240h (humidity bias persists across the whole horizon, essentially flat), τ_pr=12h (pressure ~half-life 8h). Fields without a τ (wind, clouds, solar, etc.) get flat L2 application. See sec 2d below for the live curves and per-field adoption status.
Wind / gust (direct selection, no bias): no per-station calibration (per-station wind biases are too noisy to track meaningfully) and no additive bias term. Per-octant MAX gust → MEDIAN across populated octants. The chosen value is blended into the next 24h of the hourly forecast with a linear-decay weight (100% at hour 0, 0% at hour 24). KBOS+KBVY authoritative-source floor: when both airport METARs agree on a wind speed >1.4× the octant-median pick, defer to their median. Mirrors the WU_CAP guardrail in the opposite direction. Direction-outlier guardrail rejects the chosen direction if >60° off the airport+buoy+Tempest consensus. Gust override allows single-source (METAR omits gust when wind is steady). Physical sanity floor enforces gust ≥ wind on the final values.
Cloud cover (Kalman-blended METAR override): KBOS + KBVY METAR sky-condition obs (BKN/SCT/FEW/OVC translated to cloud_cover_pct + L/M/H splits) blended against HRRR hourly[0] using _kalman_gain_cloud(n_sources, bias_std). Cloud-tuned: K=0.90 when airports agree within 20pp, K=0.70 with 20-40pp disagreement (treated as real spatial gradient, not sensor noise), K=0.50 with >40pp disagreement, K=0.35 when only one source is present. Same K applies to total cloud and L/M/H splits to keep them self-consistent. See cloud_l2_meta on each tick's weather_data.hourly for live K, σ, blended values.
2a. Octant coverage — where this tick's stations came from
2e. Post-aggregate-bias forecast — what gets passed to Layer 3
Layer 3 — Lead-decay correction
How the model's error tends to grow with each hour into the future — and the per-field nudge we apply to push back against that drift.
How decay correction works
For each field and each lead hour, we learn from millions of historical pairs how far off the model usually is — then bend the forecast back toward truth by that amount. Some fields (wind, gusts, high & mid cloud, POP) genuinely benefit; others net-negative on held-out data and are paused (see banner). Tracked over a 30-day window with an exponential recency weighting — default τ=14 days, with per-field overrides for fields where analysis/decay_tau_tuning.py shows ≥5% MAE improvement vs the default. Current overrides: pp (POP) at τ=28d (+11.1% held-out, 2026-06-21 v0.6.167) and pa (precip amount) at τ=28d (+9.4% held-out, 2026-06-22 v0.6.195). Both precip fields preferred a smoother bias estimate than the τ=14 default.
⏸ Layer 3 — per-field whitelist.
Held-out MAE audit picks the fields where L3 actually beats L2 — currently wind speed, gusts, high cloud, mid cloud. Other fields stay disabled because the correction at best ties (temperature, pressure) or actively hurts (humidity, dew point, solar, low cloud, precip). POP is the special case: it's evaluated by Brier score, not MAE, so the audit's MAE-based ⚠ rule is suppressed for it — the v0.6.20 calibration analysis showed flat-additive correction cuts Brier 5%. Currently applied: —. Brier-evaluated: —.
3a. Fitted correction curves — what is being applied per lead hour
3b. Live forecast — with vs without decay correction
3c. Decay curves over time — historical fits
Layer 4 — Diurnal correction
A separate correction for the part of model error that follows the sun — e.g., bias that's different at 3 AM than at 3 PM. Currently active only for high cloud.
How diurnal correction works
Bins historical errors by hour-of-day (0–23) and fits the persistent pattern at each hour. Most fields don't have a clean diurnal signal once L2/L3 have run, so this layer is whitelisted to just high cloud — the only field where it consistently beats L3 alone on held-out data. Disabled-field fits are still computed below for diagnostic purposes.
⏸ Layer 4 — per-field whitelist.
L4 corrects the portion of forecast error that repeats with time of day. It is the hardest layer to earn because the same hour-of-day bias must recur consistently across many days. Most fields fail that test because their dominant errors are driven by changing weather regimes (air mass, cloud regime, frontal timing, etc.) rather than the clock. Cloud cover (cc) and high cloud (ch) are the two exceptions, showing a sufficiently stable diurnal signal to pass the held-out audit and earn promotion. The remaining plots are retained as diagnostics to watch for new hour-of-day structure that may justify future promotion.
4a. Diurnal correction curves over time — historical fits
Layer 5 — Synoptic-regime correction (solar)
A per-regime W/m² delta applied to direct solar radiation. The classifier reads current wind direction, speed, pressure trend, hour, and temp; the regime label keys into a calibrated per-regime delta. L1–L4 are trained on general bias correction; L5 is the first layer trained on synoptic-state stratification — different "kinds of weather" get different corrections.
How L5 works (compact summary)
Each tick, solar_correction.stamp_solar_correction() classifies the current synoptic regime (one of nw_flow / ne_flow / sw_flow / se_flow / sea_breeze / frontal / pre_frontal / nor_easter / calm) using regime_classifier.classify_synoptic_regime(). It then looks up the calibrated delta for that regime and applies it to every lead's direct_radiation where the lead's raw value is above SUN_UP_THRESHOLD (50 W/m²). The live moment's raw also gates the delta — if it's pre-sunrise locally, delta = 0 regardless of regime.
Live snapshot reads viaweather_data["solar_correction"] — exposes candidate_delta_wm2, applied, and the classified regime metadata. Per-layer attribution lives in direct_radiation_post_l4 (pre-L5 array, preserved before mutation) and live direct_radiation (post-L5). The snapshot writer emits sr_l4 and sr_l5 separately so the Fitter can audit L5 vs L4 cleanly.
Where we are (2026-06-29):
Shipped 2026-06-28 v0.6.248 — ENABLED=True after the L5 gate cleared 7/7 ship days (12-cycle SHIP streak).
Attribution fix v0.6.249 — initial ship was silently absorbing L5 into the L4 column (same bug shape as the L6-absorbed-into-L2 issue). Fixed by preserving direct_radiation_post_l4 before mutation and adding sr_l5 to the snapshot writer.
Chart wiring v0.6.250 — sr card now shows an amber L5 line + column + "L5 ✓ synoptic" badge. _layersFor() filters the L5 entry to the sr card only.
Subsections 5a–5d (live state, regime classifier output, per-regime delta table, audit panel) — placeholders below; full build coming with the specialists refactor planned post-2026-06-30.
5a. Live correction — what is being applied right now
Placeholder — populated with the specialists refactor (post-2026-06-30). Today, see weather_data["solar_correction"] in the snapshot for live state.
5b. Regime classifier — current state breakdown
Placeholder — populated with the specialists refactor. Today, the classifier state is logged each tick in solar_correction.regime.
5c. Per-regime delta table
Placeholder — populated with the specialists refactor. Today, see solar_correction.REGIME_DELTAS for the calibrated per-regime W/m² values.
5d. L5 vs L4 audit (held-out MAE)
Placeholder — populated with the specialists refactor. Today, the per-layer Forecast Accuracy chart's sr card is the canonical audit; first clean read ~2026-07-05.
Layer 6 — Microclimate correction (temperature)
A small Δ°F added to the temperature forecast based on a within-network spatial signal: how much the waterfront stations (Willow Rd, Neptune Rd) typically diverge from the inland-network median under different wind / sea-breeze / hour combinations. L1–L5 are trained on forecast-vs-aggregated-obs errors; L6 is the first layer trained on a spatial differential between station subgroups. Two physical regimes: cove warms a few °F under S/SE/SW sea-breeze (peninsula-lee heating) and cools a few °F during 09–16 EDT when the sea breeze is inactive (peak −3.7 °F around 12:00 EDT; a cool marine pool over Salem Sound persists while inland warms).
How L6 works
Each forecast lead gets its own Δ°F. For lead i, the collector reads the forecast wind direction (hourly.wind_direction[i]), parses the local hour from hourly.times[i], and applies a heuristic sb_active (on during 13–18 EDT with S-half wind, off otherwise — coarser than the live sb detector but the only fields we have forecast for). It then looks up a Δ°F from one of two tables — (sb_active, wind_octant) when sb is on, (hour_of_day) when sb is off — and adds it to that lead's corrected_temperature cell. Tables built from a 12-day waterfront-vs-inland gradient log (cove_gradient_log.json, n=1,732); cleared a 2-read confirmation gate on r5_cove_analysis.py (2026-06-25 SHIP + 2026-06-26 SHIP, both regime tests PASS).
Why per-lead matters. The first ship (v0.6.231) applied the current-tick Δ to all 48 leads — wrong by 3–5°F at distant leads when the table swing crossed zero (e.g. applying noon's −3.7°F to a midnight lead). Per-lead projection (v0.6.237) fixed this. Across the 48-hour horizon the per-lead Δ now ranges roughly −3.7 to +2.0°F with a near-zero mean, exactly as expected from the lookup-table shape.
Where L6 is evaluated. Two places: (1) 6d below uses cove_gradient_log.json directly (cove-specific obs vs displayed temp); (2) the Forecast Accuracy chart's temperature card has a green L6 line built from pair-log rows with error_l6. Pre-deploy pair rows (06-26 ~08:00 → 06-26 17:19) used the broken uniform-Δ implementation and are filtered out of the Fitter's L6 aggregation — the L6 line will populate only with post-v0.6.237 rows. Full-window clean read by 2026-07-03 once the bad rows age out naturally.
6a. Live correction — what is being applied right now
6b. Lookup tables — full per-regime Δ°F catalog
6c. Waterfront-vs-inland Δ over time
6d. L6 evaluation — cove obs vs (L4 only) vs (L4 + L6)
Research & Diagnostics — experimental signals + audit views (not applied to live forecast)
Diagnostics
R0. L3/L4 audit table — is each layer earning its keep?
What this is: the headline diagnostic for the correction stack — average held-out MAE per field, per layer (across leads 1–47). Recomputed every Fitter cycle. Each MAE cell shows the size of the typical error; the dim subtext next to it is the bias (signed mean error), revealing systematic offsets MAE alone can hide. The Δ columns compare each layer to the one below — green means it beats AND it's applied; amber means it beats but is NOT applied (missed opportunity, matches the orange banner color); red means it loses; gray "0.00" means tie. The Applied? columns are color-coded Yes (green) / No (red) / — (gray, n/a for L2 fields that don't have an obs network). Two banners watch for trouble: the red one fires if any enabled layer is losing by >3% in some lead band (1–6h / 6–24h / 24–47h) — catches regressions hidden by overall averages. The orange one fires if any disabled layer is winning by >3% in some band — catches opportunities we're leaving on the table. Both lists exactly which (field, layer, band) triggers so the alerts are actionable.
D1. Drill-down — see each correction layer build up (teaching view)
What this is: a visual build-up of the live forecast, layer by layer. Pick a field, pick layers, hit Play to watch L1 → L2 → L3 → L4 animate onto the chart. For fields where the L3/L4 whitelist disables a layer, that layer's line sits exactly on top of the layer below — there's no correction being applied, so nothing changes visually. Useful for sanity-checking a specific field's stack when something looks off elsewhere.
Fields
Layers
Active hypotheses
R2. State-stratified accuracy — which regimes does the model fail in? (active)
What this is: per-field MAE sliced by atmospheric regime (wind direction, wind speed, cloud cover, pressure trend, flow regime, synoptic pattern). When a field's MAE varies a lot across regime bins, that's a sign a regime-aware correction layer could help — different regimes need different corrections, and a one-size-fits-all decay correction (L3) misses the structure. Where the current ranking points: solar dominates the top 5 opportunities — synoptic regime gives a ~144 W/m² bias spread between best and worst bins. L5 (regime-aware solar) in solar_correction.py was built off this signal (gated off; see G1). Re-fit twice daily by the Fitter, published to state_stratified_accuracy.json.
R6. Regime-transition penalty (Stage 2 — auto-wired in Fitter)
Hypothesis: pairs where state_fc.regime_synoptic (regime the model predicted for the obs hour) differs from state_obs.regime_synoptic (regime that actually materialized) show materially worse MAE than "stable" pairs where the regimes agree. If true, the system should widen confidence bands when the model itself signals a regime transition in the forecast window. Data: pair log + state metadata.
✓ Promotion gate passed (7-window agreement)
Promotion gate run via analysis/simulate_windows.py across 7 trailing daily cutoffs on 7-day windows. All 7 cutoffs returned SHIP. ~25 of 56 (field × lead band) buckets show ≥10% transition penalty. Strongest effects: wind speed +73% at 0-5h, wind direction +45–72% across all bands, wind gust +63% at 0-5h, temperature +12–24% at 0-23h. ~40% of pairs are "transition" pairs. Solar at 12-23h improves by ~19% on transition pairs (transitioning-to-clear is easier than stable-cloudy).
Wired into the stack: R6 is C1a (regime-synoptic transition axis) — drives C1's confidence-widening table. Verdict logged under conditional_audits.r6 on every Fitter cycle; surfaced via S1 alongside L5.
Manual script:analysis/regime_transition_audit.py (single-window detail). Promotion gate:analysis/simulate_windows.py (7-window agreement test). Live verdict appears in S1 below.
Backlog — Stage 1 candidates (curated text, not yet running)
What these are: hypothesis ideas at Stage 1 of the promotion pipeline — written down, not yet wired to a script. Promotion path: Stage 1 (this list) → Stage 2 (one-off analysis/* script + verdict) → Stage 3 (auto-wired in Fitter, logged per cycle) → Stage 4 (shipped layer or confidence widening).
Framing (2026-06-20): the 8 ideas Joe listed don't peer-rank cleanly as "8 future layers." Most surviving hypotheses measure forecast uncertainty, not forecast bias — they belong as axes of C1, not as standalone L7/L8. The two real bias-correction candidates (marine layer, radiational cooling) overlap heavily with L2 (waterfront capture) and L4 (diurnal). Grouped below by how they should actually be worked.
Group A — C1 multi-axis confidence extension
C1 v1 widens/narrows confidence on a single axis (regime-synoptic transition, shipped 06-19). v2 multi-axis plumbing (cluster-spread + pressure-tendency) shipped 06-20 in v0.6.151. KBOS-vs-KBVY cloud disagreement is the third axis candidate, waiting on dual-source data. All feed the C1 Stage 3.5 calibration audit on 2026-06-26.
[🟢 Auto-wired · confidence_layer.py v2]★ Station-cluster disagreement (Joe top-3) — STAGE 3 (auto-wired 2026-06-20 v0.6.149). Smoke test on 2-day overlap: 18/20 (field, band) combos Q4/Q1 MAE ratio ≥1.20 (temp 0-5h hit 3.14×). Orthogonality vs R6 transition flag: 16/20 cells ORTHOGONAL — P(transition | high spread) only +9.5 pp above P(transition | low spread), so spread is a genuinely independent signal. Persistent logger live, ~144 ticks/day, axis is now classified live in confidence_layer.py v2.
[🟢 Auto-wired · confidence_layer.py v2]★ Pressure-tendency regime — STAGE 2 PROMOTE (verdict 2026-06-20). dP/dt (falling_fast / falling / flat / rising) as orthogonal regime axis. No logger needed; state_fc.pressure_trend_hpa_3h already on every pair-log row, 30-day history available. Verdict on 745k pairs: 14 ORTHOGONAL, 16 REDUNDANT, 2 CONFOUNDED. Signal is concentrated at short leads (0-11h) — long leads don't care about current pressure trend, physically intuitive. Most striking: P(R6 transition | falling_fast) = 19.9% vs P(R6 transition | flat) = 62.4% — the two signals anti-correlate, confirming independence. Axis now classified live in confidence_layer.py v2.
KBOS-vs-KBVY cloud disagreement. Same idea applied to v0.6.134 blended cloud obs. Smoke test gated to 2026-06-26 (need 7 days of dual-source pair-log data). Companion to cluster-spread; if alive → 4th C1 axis (C1d).
[🟢 Auto-wired · confidence_layer.py v3] [STAGE 2 SHIPPED 2026-06-24]★ Forecast precip_fc>0 as C1f axis — broad scope (SHIPPED 2026-06-24 v0.6.215).analysis/h_forecast_coherence.py classified each pair by (precip_fc, cc_obs) cross-tab. The "incoherent_dry_obs" cell (model says rain but obs reports clear sky, n=257) showed extreme MAE elevation on every field — cl +959%, pa +674%, cm +547%, cc +139%, t +89%, h +76% — but small sample. Wider finding from coherent_wet cell (precip_fc>0, cc_obs≥30%, n=5,520): ch +257%, cm +423%, cl +736% — same direction at large sample. Generalization:state_fc.precip_in>0 alone is a confidence-widening axis independent of regime/transition. Orthogonality check (analysis/h_precip_fc_orthogonality.py) confirmed: 13 ORTHOGONAL cells vs C1a, 8 ORTHOGONAL vs C1e — 21 orthogonal cells total across t/h/ws/wg/cl/cm/ch.2026-06-24 re-run: 14 vs C1a + 9 vs C1e = 23 ortho cells (↑2 from 06-23). Verdict strengthened on fresh window. cl is the standout (3.5-3.7× elevation, clean across both checks). cc is REDUNDANT (definitionally correlated with precip_fc — makes sense). Strongest C1 axis result of today's exploration. Implementation (SHIPPED 2026-06-24 v0.6.215): 4th axis added to analysis/c1_confidence_calibration_v2.py (state_fc.precip_in > 0 → "p1" else "p0") and confidence_layer.py (live per-band lookup of hourly.precipitation across each band's lead window). Regenerated curated v3 table on 14-day window (1.29M pairs scanned, 296,898 multi-axis pairs joined): 296 SHIP / 42 MARGINAL / 1048 SKIP across 39 axis-keys. Top SHIP-bearing keys: Q23::rising::transition::p0 (43 cells), Q23::rising::stable::p0 (41), Q1::rising::transition::p0 (41). p1 cells are sparser by construction (precip_fc > 0 ~5-10% of time) — most p1 cells SKIP on sample floor. ENABLED=False so bands stamp but UI doesn't yet consume — Stage 4 gate is the UI calibration audit. Re-confirm 2026-06-29 alongside walk-forward #4.
[✗ KILLED 2026-06-24 v0.6.217 · moved to Retired] [KILLED]★ C1g — RH ≥95% fog regime axis (KILLED 2026-06-24 via orthogonality). Stage 0 (analysis/h_rh_saturation.py) stratified MAE by observed humidity bin. When state_obs.humidity ≥95% (atmosphere at saturation, fog regime): cm +158%, ch +139%, pa +4649% MAE elevation; cloud cover saturates (-59%, model less wrong because clouds are usually there); t/dp converge (fog clamps temp toward dewpoint). Wind unaffected. 2026-06-24 re-run: cm +134% (was +158), ch +149%, pa +2893%, cl saturating +67%. Magnitudes within 20% of 06-23 reads — direction-stable. cl saturating positive is a new flag. Why physically meaningful: saturation regime has different radiative + microphysical behavior than dry air; the model treats it as continuous with humid air but the actual variance structure changes. Architectural slot: obs-keyed axis in confidence_layer.py v3 alongside C1a/C1b/C1c/C1f. 2026-06-24 orthogonality check (h_c1g_orthogonality.py) KILLED the hypothesis: 1 ortho / 69 redundant / 0 confounded / 2 ambiguous across 72 cells (9 fields × 4 bands × 2 axes-to-test). The Stage 0 cm/ch elevation was a sampling artifact — fog (obs_humidity≥95) heavily co-occurs with both rain-forecast (C1f) and cc_fc≥95 (cc-saturation). When you marginalize over the unused axis and inspect the F=False or S=False subsets, fog rows actually have smaller error than non-fog (ratio 0.02–0.25× across cl/cm/ch). The independent signal isn't there. Moved to Retired per canon rule.
[⚫ Manual · last run 2026-06-24] [Tier 3]★ C1h — Forecast trend-direction widening (Stage 1, NEW 2026-06-23). Stage 0 (analysis/h_trend_direction.py) compared MAE under three forecast-trend states: rising (Δ>threshold over next 6h), falling, stable. Result: when model commits to a sharp 0→6h change, accuracy collapses. cl rising +1030% MAE vs stable baseline (n=190 — small but striking), cm rising +315%, ch rising +91%, cc rising +64%, t rising +51%. 2026-06-24 re-run: cl rising +999% (n=194), cm +299%, ch +78%, cc +64%, t +53%. Two-read direction-stable; magnitudes within 4% of 06-23. n still small per class, 30d-window confirmation still needed. Stable forecasts are dramatically better than either trending direction. Falling trends slightly less extreme than rising for clouds. Physically: the model is calibrated to confidently extrapolate persistence; committing to a change is a higher-uncertainty call. Caveat: only 1 read; sample sizes small per class. Magnitudes need 30d-window confirmation before architectural commitment. Architectural slot: compute Δ between forecast at lead 0 and lead 6 (or per-hour rolling); widen C1 bands when |Δ|>threshold. Open question: orthogonality vs C1f (precip_fc>0 often coincides with rising cloud forecast) and vs C1e (post-frontal periods involve cloud transitions). Re-confirm via 30d window in 2-3 weeks.
[⚫ Manual · last run 2026-06-24] [Tier 2→3? — weakening on 06-24 re-run]★ Time-relative-to-front as C1e axis — bidirectional, narrow scope (Stage 1, NEW 2026-06-23, refined 06-23). `analysis/h_hours_since_front.py` joined 306,612 pair-log rows with `frontal_events_log.json` (6 passages spanning 06-17 to 06-22): post-frontal 24h-window cloud MAE elevated (ch +253-281%, cc +94/+117/+54/+7%, cm +60/+47/+43/+21%, h +23/+32/+16/+3%, t mixed, wind counter-intuitively LOWER). Orthogonality check (Stage 1.5) ran same-day:analysis/h_hsf_orthogonality.py cross-tabbed each (field, band) by hsf_group × C1a transition flag. Result: 6 ORTHOGONAL / 23 REDUNDANT / 4 CONFOUNDED / 3 AMBIGUOUS. 2026-06-24 re-run weakened: 3 ORTHO / 22 REDUNDANT / 6 CONFOUNDED / 5 AMBIGUOUS. Only ch (3/4 bands) holds orthogonality; cc lost it. cm gained CONFOUNDED. Signal degrading as the 06-17→-22 frontal cluster ages out of the window — needs more frontal passages. The 6 ORTHOGONAL cells are tightly concentrated: ch (all 4 bands, stable ratio 2.08-2.91×) and cc (12-23h, 24-47h, stable 1.45-1.62×). Everything else — temp, wind, humidity, dewpoint, cl, cm, short-lead cc — is redundant with C1a (the regime-transition signal already captures whatever post-frontal effect those fields have). Wind being "lower" post-frontal was the redundancy giving itself away. Verdict: PROMOTE as narrow C1e covering only ch (all bands) + cc (long-lead). Not a generic axis. Also notable: when both axes fire (post-frontal AND C1a transition), ch MAE hits 6.64× baseline at 6-11h — the two signals compound. Re-confirm 2026-06-29 alongside walk-forward #4 and the other Group A/D Stage 1 candidates. Caveat: only 6 frontal passages in sample; magnitudes are direction-stable but size needs more passages. If signal holds, Stage 2 wires hsf into confidence_layer.py as a third (ch) or fourth (cc) axis cell with the same axis-key composition pattern as C1b/C1c. Pre-frontal companion (Stage 1, NEW 2026-06-23):analysis/h_pre_frontal.py measured hours-UNTIL-next-front (approaching wind shift). Pattern is the mirror of post-frontal: pre-frontal hits wind hardest (ws +143% at 3-6h, wg +138%) while temp/humidity DROP (-30 to -47%). Different physics: approaching wind not yet modeled. Orthogonality check (analysis/h_pre_front_orthogonality.py) returned 8 ortho cells (4 vs C1a, 4 vs C1e) — but all 8 land in the 24-47h lead band (short-lead × pre-frontal × no-transition × no-post is too sparse to evaluate). Narrow promote: extend C1e to bidirectional via a `time_to_nearest_front_h` signed value; widen wind/cm bands at long-lead when |Δt|<24h. Re-confirm 2026-06-29.
Group B — Bias candidates (paced, individually)
★ Cloud-ceiling regime correction (Joe top-3). cc/cl/cm/ch L3/L4 walk-forward read is already gated to ~2026-06-26 — runs alongside it. Test whether errors differ by low-cloud-present vs mid/high-only and by ceiling-trend direction. May come out as a bias or as a confidence axis; the same script answers both.
[🟡 Hybrid · 🔒 sandbox stamps every tick in collector.py, ENABLED=False · 🟢 Stage 2.5 daily watch auto-wired in conditional_audits.marine_layer_watch]★ Marine-layer / harbor inversion correction (Joe top-3) — STAGE 3 SANDBOX live (v0.6.199, field-name bugfix v0.6.201, ENABLED=False); STAGE 2.5 daily watch parallel (v0.6.182). Stage 3 sandbox = weather_collector/processors/marine_layer_correction.py stamps weather_data["marine_layer_correction"] every tick with per-lead candidate deltas (-18.8% at 6-11h, -31.8% at 12-23h, -35.5% at 24-47h, 0-5h skipped). Gate: wd ∈ [45°, 105°) AND hour_local ∈ [4, 9), applied per forecast lead. Cap: 40% magnitude. Code is one-flag-flip from live; ENABLED stays False until the weekly Sun-morning re-reads (06-28 / 07-05 / 07-12) confirm. Stage 2.5 daily-watch in conditional_audits.marine_layer_watch continues to log the bias every Fitter cycle for daily visibility on the same signal. Pilot stratification: NE flow (wd 45-105) × morning hours (4-9 EDT). Finding: cloud-cover forecast over-calls by +28.1 mean / +25.0 median in NE+morn (n=3,119) vs near-zero bias in all other strata. Temp/dewpoint/humidity NOT elevated — this is a cloud-skill bias, not a temp bias. Invisible to global L3/L4 walk-forward (consistent L2-only-recommend for cc/cl) because it lives in ~3% of conditions. Stage 2 verifications: bias is robust to bin perturbation (±15° wd, ±1h hour all in +20 to +29 range), strongly lead-dependent (−2.0 at 0-5h → +35.5 at 24-47h), and temporally non-stationary (ISO-week trend: W23 +11.6 → W24 +32.5 → W25 +38.0). Open question: seasonal stabilization or episode-driven? If W26-W28 stay in +25-+40, flip marine_layer_correction.ENABLED=True by mid-July. Architectural slot: sibling to L5 (regime-conditional bias correction); L7 vs generalize-L5 deferred until promotion.
Clear-night radiational cooling. Bins are physical and clean: clear sky + low wind + dry air. Narrow window (only nights, only ~30% of nights match) — sample size may be thin after stratification. Reasonable Q3 candidate.
Group C — Lower priority (dominated by existing layers)
Wind-direction sector correction for gusts. L3 already does heavy lifting on ws/wg (in production whitelist). The bar is "does directional structure survive after L3 has eaten the bulk signal?" — high bar. Worth revisiting if Group A/B come up empty.
Sea-breeze onset/decay timing (5-phase). Overlaps marine layer (#1) on onshore-flow hours and L4 diurnal on timing. If marine layer pilots out, this folds in; if it doesn't, this probably doesn't either.
📊 Stage 1 candidates — 2026-06-24 prioritization
Joe's call (2026-06-23): the Stage 1 queue is deep enough that not all candidates can ship at once. Manual re-runs over the next 2-4 weeks will decide which graduate to Stage 2 implementation. Tier system below ranks by expected value (probability × user-visible impact × implementation effort). 2026-06-24 batch re-run: all 8 manuals re-fired with fresh data; cc→L4 recovered to +5.0% (2nd ≥3% read), C1f +2 ortho, K-taper held, dp depression added nor_easter +3.79°F, C1e post weakened (6→3 ortho).
STAGE 2 SHIPPED 2026-06-24 v0.6.214 — cc added to L4_FIELDS in decay_apply.py:70. Last sim read: cc +5.0%, cm +3.0% (06-24). cm rides along candidate; confirm 06-29.
✓ promoted; live next collector tick
2
Cloud saturation-unbiasing
⚫ Manual
2026-06-24 (h_cloud_floor_ceiling.py) — cl 95-100 +57.5pp (was +63.4); cc +31.2, cm +55.1, ch +49.9. Direction-stable.
t-vs-dp attribution clear + signal holds (✓ direction-stable)
KILLED
Wind shift rate (Δwd_3h)
— retired
2026-06-24 (h_wind_shift_rate_orthogonality.py) — 1 ortho / 22 redundant / 2 confounded / 11 ambiguous; C1a captures the signal
✗ killed — moved to Retired section
Legend: 🟢 Auto-wired (runs every Fitter cycle or tick) · 🟡 Hybrid (partially wired, e.g. stamp lives but verdict needs manual replay) · ⚫ Manual (Stage 0/1 — decisions wait on hand-run scripts) · 🔒 Gated off (built + deployed, ENABLED=False).
Total in pipeline: 6 active candidates (was 10 this morning). Today's deltas: cc→L4 SHIPPED Stage 2 (v0.6.214), C1f SHIPPED Stage 2 (v0.6.215), wind_shift_rate KILLED (v0.6.216), C1g KILLED (v0.6.217), Humidity K-taper SHIPPED Stage 2 (v0.6.218). Tier 2 candidates promote to Tier 1 when their criterion holds across at least 2 manual re-reads spaced 3+ days apart. Tier 3 candidates need more evidence before deserving production-stack architectural commitments.
Group D — Methodological refinements (modify existing layers, not new ones)
[🟢 Auto-wired · corrected_hourly.py v0.6.218] [STAGE 2 SHIPPED 2026-06-24]★ Lead-conditional L2 Kalman gain — humidity only (Joe top-3, SHIPPED 2026-06-24 v0.6.218). Additive L2 bias on (t, h, pr) is currently applied with flat K across all 48 forecast leads. Initial Stage 0 (analysis/h_lead_l2.py, 3 cutoffs 06-15/-18/-22) showed dramatic per-lead gain decay: h +45/+47/+45% at 0-5h → -2.9 to +12% at 24-47h; t +16/+22% at 0-5h → ~0 at 24-47h; pr +10-16% at 0-5h → ~0 at 24-47h. But the actual K-taper simulation (analysis/h_lead_l2_ktaper_sim.py, modeling new_forecast = L1 + (L2-L1)×ramp(lead)) on today's window shows the only field that actually ships as net-win is humidity: h gains +7.75% with soft_ramp (taper 100% at lead 0 → 40% floor at lead 24); t and pr show ≤0.3% drift across all ramp shapes — flat K is optimal for them.2026-06-24 re-run: h soft_ramp +6.60% (was +7.75% on 06-22 — slight regression but still well above 5% ship floor); t/pr still flat (≤0.5% drift). Direction-stable across two reads. SHIPPED 2026-06-24 v0.6.218 — soft_ramp wired in corrected_hourly.py via new _soft_ramp_factors() helper. Curve: K(0h)=1.0 → K(6h)=0.85 → K(12h)=0.70 → K(18h)=0.55 → K(24h+)=0.40. The prior `corrected_humidity` decay used `exp(-lead/240)` which was effectively flat (~0.91 at lead 24, ~0.82 at lead 48); the soft_ramp pulls L2 humidity bias toward zero at long leads where the station-network signal is stale. t and pr stay flat-K (≤0.5% drift). l2_meta on the debug page now includes a `humidity_shape` block exposing the curve. Re-confirm 2026-06-29 walk-forward #4. Monitor live audit table — if h L4 doesn't beat L3 by ≥3% over 7 days post-ship, revert. Lesson preserved: lead-band MAE shape ≠ shippable gain; always simulate the actual modification before promoting.
[🟢 Auto-wired · decay_apply.py:70 v0.6.214] [STAGE 2 SHIPPED 2026-06-24]★ Cloud cover → L4 whitelist (Joe top-3, SHIPPED 2026-06-24 v0.6.214). L4 (diurnal hour-of-day correction) currently excludes all cloud fields. Stage 0 stratification (analysis/h_cloud_diurnal.py) showed strong signed bias spreads across the 24-hour cycle: cc 33pp, ch 51pp, cl 24pp, cm 25pp. Clean sinusoidal shapes, textbook L4-catchable. Train/test simulation (analysis/h_cloud_l4_sim.py, 70/30 split): 06-22: cc +5.0%, ch +2.8%, cm +1.8%, cl -0.8%. 06-23: cc +2.7% (below 3% floor), ch +4.5%, cm +2.7%, cl -2.0%. 2026-06-24: cc +5.0% RECOVERED, cm +3.0%, ch +1.9%, cl -1.1%. cc passes 2-read ≥3% gate (06-22 +5.0 + 06-24 +5.0). cm now ≥3% on both 06-23 and 06-24 reads — also qualifying. ch dropped from +4.5 to +1.9 — the day-23 spike was an artifact. Verdict:SHIPPED 2026-06-24 v0.6.214 — L4_FIELDS = {"ch", "cc"} in decay_apply.py:70. cc correction goes live next collector tick. cm rides along as secondary whitelist candidate (3.0% sits at the edge — confirm 06-29 before adding). cl stays disqualified. Monitor cc per-layer MAE on the live audit table over the next 7 days; if cc L4 MAE doesn't beat L3 by ≥3% in production, revert.
[⚫ Manual · last run 2026-06-24] [Tier 2]★ Cloud saturation-unbiasing (Stage 1, NEW 2026-06-23). `analysis/h_cloud_floor_ceiling.py` stratified cloud forecasts by forecast-value bin and measured signed bias against observation. Result: **dramatic asymmetry at the saturation extremes.** When model forecasts cc in 95-100% bin, observed cc averages **32.7 pp lower**. When it forecasts cl in 95-100%, observed averages **63.4 pp lower** (n=13,391). cm 95-100: -54.3 pp. ch 95-100: -47.5 pp. Floor side smaller but present (cc 0-5%: obs averages 21.5 pp higher). 2026-06-24 re-run: cl 95-100 bias +57.5pp (was +63.4), cc +31.2, cm +55.1, ch +49.9. All magnitudes within 5pp of 06-23 reads — direction-stable across both reads. cl 95-100 still ≤-50pp (2/3 needed for gate). **Architectural significance:** L3 (decay) conditions on (field, lead) and L4 (diurnal) conditions on (field, hour) — neither conditions on the forecast VALUE itself. The current correction stack structurally cannot fix saturation bias because every layer treats forecast value as a free variable, not as a stratification axis. **Stage 2 implementation candidate:** "saturation unbiasing" pre-correction applied before L3/L4, learning a per-forecast-value-bin signed shift that pulls 95-100 forecasts toward the observed mean and pushes 0-5 forecasts up. Could ship as L2.5 (between mesonet and decay) or as an L3-axis extension. Estimated impact based on bias magnitudes: 5-15% MAE reduction on cloud fields at the high-confidence-extremes where users care most ("is it definitely raining tomorrow?"). Re-confirm 2026-06-29; first read on whether the floor/ceiling shape is stable across windows.
[⚫ Manual · last run 2026-06-24] [Tier 3]Regime-conditional dewpoint depression correction (Stage 1, NEW 2026-06-22). Even when t and dp are individually corrected by L2/L3/L4, the depression (t − dp) — which drives fog formation, heat index, and comfort score — can still be systematically off because t and dp errors don't perfectly cancel. analysis/h_dewpoint_depression.py joined t-rows and dp-rows at each obs_time (n=121,922 joined pairs); overall depression |err|=4.36°F with near-zero overall bias, but stratified by observed regime: frontal regime shows -2.19°F depression bias (n=5,243), sea_breeze +1.45°F (n=5,119), sw_flow +1.41°F (n=15,661). Frontal: forecast says drier than reality (depression too large). Sea_breeze + SW flow: forecast says wetter than reality. Direction is physically sensible — fronts bring moist air the model doesn't fully resolve; sea breeze brings drier inland air through faster than the model assumes. 2026-06-24 re-run: frontal -1.98°F (was -2.19), sea_breeze +1.45 (exact match), sw_flow +1.40, plus NEW: nor_easter +3.79°F★ (n=279) — extreme but small-n. All three established regimes direction-stable; nor_easter is a new high-magnitude flag worth a third read to confirm. Re-confirm date: 2026-06-29. If signal holds, Stage 2 needs a sub-analysis: is the residual bias coming mostly from t or dp? L2/L3/L4 already correct t aggressively, so the residual likely lives in dp. Implementation = a small regime-conditional dp shift table (L5-shape, dp-only). Affects fog probability, feels-like, humidity-comfort scoring downstream — high user-visible impact even at modest correction magnitudes.
Promotion rule: write a single-shot script in analysis/. Stage 2 verdict must hold across at least 2 reads spaced 3+ days apart. Group A candidates promote to C1 axes (C1a, C1b, C1c, ... — no R-number); Group B candidates earn R-numbers if they survive the 7-window walk-forward gate.
Surviving Stage 0 outputs: design seeds for future hypotheses, breadcrumbs pointing to promoted Stage 1 candidates above, data-limitation flags, and one open bug. Kills + methodological nulls live in the Retired section below (single source of truth).
Asymmetric L3 (over- vs under-call partition). Wind L3 wins +57% on over-calls but loses -120 to -166% on under-calls; ch +24% vs -13%. Real asymmetry but not directly actionable — you can't predict which side a future pair falls on. Logged as design seed for a future "L3-with-confidence-gate" hypothesis. (h_asymmetric_l3.py)
Run-time issuance bias. HRRR initialization hour shows clean sinusoidal pattern on humidity L1 MAE (8.0 at run_h=0-2Z → 10.5 at run_h=12-13Z, 33% spread); t and cm also ≥10% spread. But potentially confounded with valid-time-of-day (each run_h correlates with specific obs distributions). Needs a controlled follow-up that holds valid_time fixed. Design seed; not promoted. (h_run_time_bias.py)
Hours-since-front × MAE. Initial run (06-22) hit Cloudflare-blocked default urllib UA → HTTP 403. Re-run 06-23 with curl UA header — SUCCEEDED. Big finding: ch MAE 3.5× baseline for 24h post-passage; cc/cm also elevated. Promoted to Stage 1 as C1e candidate (see Group A above). (h_hours_since_front.py)
Pre-frontal hours-until-next-front × MAE. Mirror of post-frontal — but hits WIND hardest instead of clouds (ws +143% at 3-6h before, wg +138%, cm +98%, cl +62%). Temp and humidity LOWER pre-frontal. Orthogonality vs C1a + C1e: 8 ortho cells at 24-47h (short-lead × pre-frontal sparse). Promoted as bidirectional extension of C1e (see Group A above). (h_pre_frontal.py, h_pre_front_orthogonality.py)
Forecast self-coherence (precip_fc × cc_obs). Model forecasts rain while obs reports clear sky → extreme MAE elevation on every field (cl +959%, pa +674%, cm +547%, t +89%, ...). Generalized: state_fc.precip_in>0 alone is a confidence axis. Orthogonality vs C1a + C1e: 21 ortho cells. Promoted as C1f (see Group A above) — broadest scope of the day's findings. (h_forecast_coherence.py, h_precip_fc_orthogonality.py)
Front-type asymmetry (cold vs warm).frontal_events_log.json records type field but the detector only classifies sea_breeze (1 instance) — the other 5 passages return "unknown". No cold/warm separation possible. DATA LIMITATION. Infrastructure improvement: extend frontal_detection.py to classify by wind-direction rotation (clockwise → cold, counterclockwise → warm) and pressure-tendency shape. Then re-test. (h_front_type.py)
Cloud floor/ceiling truncation. Massive saturation asymmetry: cc fc=95-100% averages -32.7pp signed bias, cl 95-100% averages -63.4pp, cm -54.3pp, ch -47.5pp. Promoted to Stage 1 as "saturation unbiasing" candidate (see Group D above) — architecturally distinct from L3/L4 because no existing layer conditions on the forecast VALUE bin. (h_cloud_floor_ceiling.py)
precip_obs > 0 as obs-keyed confidence mirror. Stratified by (precip_fc, precip_obs) quadrant. Worst case is fc-only (false alarm, n=2,904): cl +283%, wg +57%, h +50%, cm +107%. Real signal but architecturally tricky — at current tick we have precip_obs[now], not precip_obs[future]. Different mechanism than C1f, possible extension via "if it's currently raining, widen future confidence on clouds/wind." Design seed, needs more thought before promoting. (h_precip_obs.py)
Cloud composition (single layer vs layered). wg +65% on three-layer skies, ws +25% on two-layer, t/h ~18-35% — real but small. ch +106-131% with layers but partly tautological (composition variable includes the field being tested). Weak design seed — wind portion not strong enough alone, cloud portion confounded with cc magnitude. (h_cloud_composition.py)
RH ≥95% (fog regime) × MAE. Saturating-humidity rows show cm +158%, ch +139%, pa +4649% MAE elevation in unconditioned data. Promoted as C1g 2026-06-23, then KILLED 2026-06-24 by orthogonality check (1 ortho / 69 redundant). The elevation was sampling-driven — fog co-occurs with C1f + cc-saturation. Moved to Retired. (h_rh_saturation.py, h_c1g_orthogonality.py)
Forecast trend direction (rising/falling/stable). When model commits to a sharp 0→6h cloud change, accuracy collapses: cl rising +1030%, cm rising +315%, ch rising +91%, cc rising +64%. Stable forecasts are dramatically better. Promoted as C1h (Tier 3 — only 1 read; magnitudes need 30d window confirmation + ortho vs C1f/C1e). (h_trend_direction.py)
wind speed bin × wind direction error. Script bug: join between wd-row and ws-row at matching (run_time, lead, obs_time) returned empty. Investigate later — the per-pair key structure may not align correctly across fields. (h_ws_wd_error.py)
Wind direction shift rate (|Δwd_3h|) × MAE. Stage 0 ran 2026-06-24 — rotating ≥80° class showed ch +33%, cm +24%, cc +15% MAE elevation. Promoted to Stage 1 as alt-transition axis candidate, then KILLED same day by orthogonality check: h_wind_shift_rate_orthogonality.py returned 1 ortho / 22 redundant / 2 confounded / 11 ambiguous. C1a (regime transition) already captures the same physical pattern — wind shifts and regime transitions co-occur. Only ch at 24-47h was independently orthogonal, too narrow to ship alone. Moved to Retired section. (h_wind_shift_rate.py, h_wind_shift_rate_orthogonality.py)
Lightning proximity × MAE. Stage 0 attempted 2026-06-24. Pair log doesn't carry lightning data per row (Tempest lightning_strike_last_3hr + lightning_strike_last_distance live in station_history.json, not forecast_error_log.jsonl). INFRASTRUCTURE GAP — not killed. Re-test once a lightning_proximity_km field is added to pair-log rows in forecast_error_log.py. (h_lightning_proximity.py)
Operational tools — live audits & shadow tracking
G1. Gated correction candidates — what L5 / C1 would do right now
What this is: two corrections are sketched in code with ENABLED = False. They stamp their candidate values on weather_data every tick so we can observe what they would do without actually modifying the forecast.
R5 cove correction — RETIRED 2026-06-17. (wind_octant × sb_active × hour) conditional Δ°F. Re-audit on 2026-06-17 (n=32,816) confirmed HOLD: R5+L4 makes cove temperature −20.58% worse. L2's station weighting (Willow Rd + Neptune Rd at ~0.1–0.2 mi) already captures the waterfront signal. R5 Stage 2 wiring stripped from decay_fit.py in v0.6.124. Standalone script analysis/r5_audit.py kept for quarterly re-checks.
L5 solar correction — Stage 2, gate trending clear (4 SHIP / 0 HOLD over trailing 7d as of 2026-06-25). Biases refit 06-21 v0.6.168 (frontal -169.2 → -81.1, se_flow -27.3 → -114.9); Fitter L5 audit recency-weighted v0.6.178. Trailing-7-day gate auto-computed each Fitter cycle since v0.6.180 (l5_gate_history.json) and shown beneath the L5 verdict row in S1 — no need to run analysis/simulate_windows.py manually. Don't re-refit biases mid-trajectory or simulator reads become uninterpretable. Earliest plausible promotion now late-June if the SHIP streak holds.
[🟡 Hybrid · 🔒 stamps every tick in collector.py via confidence_layer.py, ENABLED=False · 🟢 Stage 4 audit script lives at analysis/c1_stage4_audit.py]C1 confidence layer (multi-axis v3) — Stage 3 stamping live; v3 table (4-axis with C1f) shipped 2026-06-24 v0.6.215; Stage 4 audit built same day v0.6.216. Every tick stamps weather_data["confidence"] with per-(field, band) displayed bands + a live_axes block. applied=False until the audit gate clears. Curated v3 table: 296 SHIP / 42 MARGINAL / 1048 SKIP across 39 axis-keys (Q × pt × trans × c1f); 1.29M pairs scanned, 296,898 multi-axis pairs joined on 14-day window. Multi-axis now firing live — 53/56 hits this tick at axis Q23::rising::stable::p0 + adjacent. Stage 4 audit v0.6.216: compares calibrated MAE on a 7d calib window vs a 7d recent holdout window; PASS ≤20% drift, WATCH ≤40%, FAIL >40%. First read 2026-06-24: legacy axis (transition × stable, 62 cells) returned 17 PASS / 20 WATCH / 25 FAIL — NOT READY, dominated by pp (Brier-evaluated, wrong yardstick) + pa (precip amount, bursty by nature). Multi-axis (296 SHIP cells) DEFERRED — cluster_spread_log only goes back to 06-20 (4d), needs 14d to reach back into calib window. First multi-axis audit ETA ~2026-07-04. Stage 4b in-card ± preview shipped v0.6.181 on right_now + wind cards reading the same cells.
Loading…
S1. Shadow whitelist tuner — what would auto-tuner have chosen?
What this is: per Fitter cycle, log what whitelist sets a naive MAE-based auto-tuner would recommend, alongside the current production whitelist. Threshold: a field is recommended ON if its layer beats the layer below by ≥3% in any lead band AND bias is no worse. Why it's here: the precondition for considering automation is "shadow tracks human decisions consistently." After 90+ days we can evaluate agreement rate. Until then, mismatches are informative (the pp case is a known Brier-blindspot), not actionable.
Loading…
B1. Backtest sweep — alternative L3/L4 configs vs production
What this is: live A/B comparison of candidate L3/L4 whitelists against production, computed by replaying the held-out pair log under each enable config. Lets us see what MAE would have looked like under any candidate config without waiting for a redeploy. Current production is L3 = {ws, wg, ch, cm, pp}, L4 = {ch}. Performance note: sweep run via python3 -m backtest.sweep --write-gcs (use --local-file ~/.cache/myweather/forecast_error_log.jsonl for fast iteration). Results below are from the last manual run.
What this is: live readout of detected frontal passages from frontal_events_log.json. The detector runs every tick in frontal_detection.py, reads a 90-minute rolling window from frontal_obs_log.json, and declares a passage when at least 2 of 3 signals fire: dewpoint drop > 8°F, wind direction shift > 60°, pressure inflection (local min then ≥0.02″ rise). Type classification uses the wind-shift target octant and pressure trend. Confidence is 67% with 2 signals and 100% with 3. Why it's here: sanity-check whether real fronts are being caught (and whether noise is being mistaken for fronts) before letting the briefing AI rely on the cause-attribution line. If the detector misses an obvious front or fires on a non-event, thresholds in frontal_detection.py are tunable.
Loading detected passages...
Retired — hypotheses ruled out & settled tunings (collapsed; click to expand)
What these are: things we built, ran, and stopped running. Two kinds get mixed here: hypotheses (a real question we tested and the data answered no) and settled tunings (a parameter sweep that concluded "current value is fine"). Each entry is tagged. Kept as institutional memory so future-Joe doesn't burn a day re-inventing them. Standalone scripts in analysis/ can be re-run at any time if conditions might have shifted.
Recently ruled out — 2026-06-22 to 06-24 Stage 0 kills
Compact entries — these were one-shot smoke tests that landed at "no signal" or "captured by an existing axis." No charts kept. Re-run the script in 2-3 months if the seasonal regime shifts substantially.
[HYPOTHESIS] Regime-conditional L3 efficacy.Killed 2026-06-22. L3 wins in every regime for ws/wg/ch/cm; pp loses everywhere but that's the documented Brier exception. Current whitelist is correctly tuned per-regime. (analysis/h_regime_l3.py)
[HYPOTHESIS] L3 regime-mismatch gating.Killed 2026-06-22. When state_fc.regime ≠ state_obs.regime, ws L3 win drops from +50% (match) to +40% (mismatch) — still big wins on both sides. Gating L3 to regime-agreement would lose the +40% to clean up marginal noise. (analysis/h_l3_regime_mismatch.py)
[HYPOTHESIS] Lead-h × C1a transition interaction.Killed 2026-06-23. Hypothesis: C1a penalty grows with lead. Result: mostly flat. Only ch shows monotonic growth (+0.31 short→long); cm marginal. ch is already in C1e, so redundant. No need for lead-conditional C1a bands. (analysis/h_lead_c1a.py)
[HYPOTHESIS] Solar zenith × cloud MAE.Killed 2026-06-23 — duplicate. Strong cloud-MAE spread by solar bin (cl 60%, cm 43%, ch 36%) but this is the same day/night cloud bias the cc→L4 Stage 1 hypothesis already addresses. Same signal sliced differently. (analysis/h_solar_cloud_selfcheck.py)
[HYPOTHESIS] Weekday vs weekend anthropogenic bias.Killed 2026-06-22 — likely artifact. Apparent gaps (t +1.14°F, h -2.65%, ws +1.34 mph between weekday and weekend) but the 21-day window contains only 3 weekends — one unusual Saturday biases the whole number. Day-of-week × weather correlation high at small N. Revisit with ≥6 months data (earliest April 2027). (analysis/h_weekday_temp.py)
[TUNING] L4 fit-window size sweep (7/14/21/30d).Settled 2026-06-23. All window sizes essentially tie for every L4-enabled field; 7d marginally worse for ch/cm. Current 30d is correct. (analysis/h_l4_window_size.py)
[TUNING] Lead-bin granularity (10 fine bands).Settled 2026-06-23. Finer bins reveal expected lead-decay shapes (wind L2 ramps from full at lead 1h to zero at lead 24h) but no hidden ship gains. Current 4-band [0-5/6-11/12-23/24-47] structure isn't hiding signal. (analysis/h_lead_granularity.py)
[HYPOTHESIS] Mesonet confidence regime × MAE.Killed 2026-06-24 — null. Used state_obs.regime_synoptic (sea_breeze + frontal = high-scatter; calm + nw_flow + sw_flow = low-scatter) as proxy for mesonet confidence label. High/low scatter MAE ratios across all 9 fields: 0.93×–1.22× (only cm hit ⚠ at 1.22×). No meaningful spread; regime classification doesn't carry through as a confidence axis on its own. (analysis/h_mesonet_conf.py)
[HYPOTHESIS] C1g — RH ≥95% (fog) as confidence axis.Killed 2026-06-24 — orthogonality check. Stage 0 (h_rh_saturation.py) showed cm +134%, ch +149%, cl saturating +67% MAE elevation. Promoted to Stage 1 Tier 2, then killed same week by h_c1g_orthogonality.py: 1 ortho / 69 redundant / 0 confounded / 2 ambiguous. The elevation was sampling-driven — fog (obs_humidity ≥95) heavily co-occurs with both rain-forecast (C1f) and cc_fc ≥95 (cc-saturation). In the F=False or S=False subsets, fog rows show ratios 0.02–0.25× (smaller MAE than non-fog) — i.e. C1g either reduces error (when conditions disagree with model) or merely tracks F/S (when conditions agree). No independent widening signal. (analysis/h_rh_saturation.py, analysis/h_c1g_orthogonality.py)
[HYPOTHESIS] Wind shift rate (|Δwd_3h|) as alt-transition axis.Killed 2026-06-24 — same-day kill via orthogonality. Stage 0 showed rotating ≥80° wind class elevates cloud MAE (ch +33%, cm +24%, cc +15%). But the orthogonality check vs C1a (regime transition flag) returned 1 ortho / 22 redundant / 2 confounded / 11 ambiguous across 9 fields × 4 lead bands. Wind shifts and regime transitions co-occur; C1a already captures the signal. Only ch at 24-47h shows independent signal — too narrow to ship as a standalone axis. (analysis/h_wind_shift_rate.py, analysis/h_wind_shift_rate_orthogonality.py)
[HYPOTHESIS] Naive persistence baseline vs HRRR.Killed 2026-06-24 — expected behavior, not a bug. Compared error_l1 vs "use current obs as constant forecast": persistence wins ws/wg at all leads (+95% to +526%), cc at short leads. Initial alarm: possible mph→m/s unit mismatch on forecast_l1. Investigated and ruled out: both fc and obs are mph (config.py:46 sets wind_speed_unit=mph for all Open-Meteo calls; wind_blend.py converts METAR knots → mph). The 2× ratio is a real model over-prediction for our sheltered coastal location, and L2+L3 already correct it (raw ws L1 MAE 4.17 mph → L3 2.44 mph). Live frontend already does persistence-style blending via blend_observed_into_hourly (BLEND_HOURS=24, weight=1.0 at current hour). Pair log captures the pre-blend raw, which is why error_l1 looks bad. Nothing actionable. (analysis/h_persistence.py)
[HYPOTHESIS] Tide-phase corrections — does forecast error track the tide cycle? (RETIRED 2026-06-08)
Verdict: weak signal, mostly entangled with diurnal cycle. Per-field tide-phase curves were tracked across weeks; the signal that survived stratification was hard to distinguish from hour-of-day patterns we're already correcting in L4. Cost of keeping it running (NOAA fetch + 12-bin accumulator + GCS history per Fitter pass) wasn't justified. Analysis: analysis/tide_hypothesis.py. Fitter module flag: RUN_TIDE_TRACKING. Frozen charts below show the final state at retirement; they will not update.
Companion view — error vs tide elevation over time (frozen)
Time-domain rendering of the same data as the bucketed chart above. Different angle on the same retired hypothesis. The frozen state below is the last fit before tide tracking was disabled.
Higher leads = forecast made further ahead. Switch to see if the tide pattern is lead-specific.
Verdict: equivalent. Tested whether deriving humidity from corrected temperature + corrected dew point via Magnus outperforms the L2 network-blended humidity. 27k triples, identical MAE within noise. We kept the derived path anyway because it keeps the (T, T_d, RH, AH) quadruple internally consistent — but the hypothesis "derivation is more accurate" is closed. Analysis: analysis/derived_humidity.py.
Verdict: τ=14 days is fine within noise. Not a hypothesis — a parameter sweep over the Fitter's recency-weighting τ (how much old pairs count when fitting decay curves). Not the L2 lead-decay τ added in v0.6.44, which controls how a current bias is spread across forecast leads (see sec 2d). Tested τ ∈ {7, 14, 21} days across four reports. Held-out MAE differences under 2%, well below run-to-run variance. τ=14d stays. With L3/L4 mostly disabled in v0.6.45, this knob barely matters anymore. Analysis: analysis/decay_tau_tuning.py.
[HYPOTHESIS] R4 — HRRR vs GFS spread as confidence signal (RETIRED 2026-06-17, verdict: CLOSE)
Hypothesis: when HRRR and GFS disagree at a given forecast hour, actual error magnitude tends to be higher — i.e. |HRRR − GFS| per (field, lead) predicts |forecast − obs|. If true, the spread becomes a free uncertainty number that can widen displayed intervals and feed Gemini hedge language ("models disagree on tomorrow's high"). Data collection: HRRR L1 already in forecast_log.json; gfs_l1_log.json captures GFS L1 per tick for the same 0-48h window. Decision rule was: ship if median Spearman ρ > 0.25 for ≥3 fields, consistent across lead bands.
CLOSE verdict (2026-06-17, 112,877 joined pairs):
0 of 6 fields above the 0.25 ρ threshold. Maximum observed |ρ| = 0.012 (wind speed at 1-6h) — essentially zero correlation. HRRR vs GFS spread does NOT predict forecast error magnitude. Retired without auto-wiring. Manual script:analysis/r4_spread_analysis.py — re-run quarterly or after a model release.
[HYPOTHESIS] R5 — Cove warming — sea breeze across the peninsula heats Wyman Cove vs inland (RETIRED 2026-06-17, verdict: HOLD — L2 already captures it)
Reframed hypothesis (2026-06-13): Wyman Cove sits in the lee of the Marblehead peninsula on a S/SE/SW sea breeze. Marine air crosses ~2 miles of sun-heated land before reaching the cove, picking up surface heat in transit. Expected pattern: delta_wf_inland = waterfront_median − inland_median goes positive (cove warmer) when wind is from the S half AND sea breeze is active, with magnitude scaling to solar input (peaking ~12-14 EDT). Should flatten to zero when wind is from N/NE (cove is windward of peninsula) or after sunset (no surface heating). Original hypothesis ("waterfront cools during sea breeze") was geographically backwards and is closed.
Day-12 refit (1,732 ticks through 2026-06-24): matches the reframed model; magnitudes tightened further as the sample grew. NW flipped from neutral to weakly negative; E cooled further.
Wind
Sea breeze
n
mean Δ°F
S
active
186
+1.5
SE
active
88
+2.0
SW
active
79
+1.1
N
inactive
378
−1.0
NE
inactive
103
−1.0
E
inactive
86
−1.3
NW
inactive
459
−0.9
Diurnal curve under offshore/calm conditions shows clean morning-marine-cooling: trough around −3.7°F at 12:00 EDT (refit 06-24, n=1,732 entries over 12 days; cool air pool over Salem Sound persists; inland warms fast with sun while cove stays anchored to marine boundary). Both signals are physically coherent with the lee-warming model.
Data collection:cove_gradient_log.json captures waterfront-tagged Tempest median (Willow Rd, Neptune Rd — both at cliff-edge elevations on the harbor, confirmed by Joe), inland Tempest median (~18 stations), ambient T, wind dir/speed, salem_water_temp_f, buoy_water_temp_f, sb_active, sb_likelihood per tick (14-day retention).
Two-step plan:
Step 1 — measurement is stable (analysis/r5_cove_analysis.py). Confirms the (wind × sb × hour) lookup table reflects a real, repeatable physical signal. Day-4 already passes both regime tests; 7-day re-run on 06-19 just confirms stability.
Step 2 — held-out MAE audit (analysis/r5_audit.py). The actual ship question: does APPLYING the correction improve cove temperature forecast accuracy? Joins the pair log against the cove log, computes error_l4 + R5_delta and error_l1 + R5_delta, compares MAE against the existing L4-corrected baseline.
Step 2 verdict: HOLD (run 2026-06-16, n=29,444 matched pairs)
Baseline (existing L4-corrected, which for temperature = L2-corrected since L3/L4 are off): 2.547°F MAE
R5 added on top of L4: 3.045°F MAE (−19.58% — significantly worse)
R5 replacing the entire stack: 3.066°F MAE (−20.39% — also worse)
The L2-overlap hypothesis was empirically confirmed. L2's 1/distance² × elevation station weighting for the cove is dominated by the two waterfront Tempests (Willow Rd, Neptune Rd at ~0.1–0.2 mi). L2's "cove bias" is effectively "waterfront bias" by construction. Layering R5's (waterfront − inland) delta on top double-counts the same signal — the cove obs is already waterfront-influenced via L2, so adding more waterfront delta pushes the forecast AWAY from the obs.
Decision (global R5, retired 2026-06-17):r5_audit.py's held-out test of R5 applied across the full pipeline showed it makes cove temp 20–22% worse — L2's waterfront-weighted station blend already captures the signal, so layering R5 on top double-counts. That global formulation stays retired.
Current status (L6 — microclimate correction, shipped 2026-06-26):cove_correction.py is live — applies a per-lead Δ°F to each forecast lead of corrected_temperature, with the Δ for lead i looked up against the forecast regime at lead i (forecast wind direction, parsed local hour, heuristic sb_active). Per-lead replaced an initial v0.6.231 implementation that applied the current-tick Δ to all 48 leads (wrong at distant leads when the table swing crossed zero); the fix shipped in v0.6.237. Lookup tables remain the bidirectional gradient (positive on S-half sea-breeze; negative 09–16 EDT when the sea breeze is inactive, peak −3.7 °F around noon). Cleared a 2-read confirmation gate on r5_cove_analysis.py (06-25 SHIP + 06-26 SHIP, both regime tests PASS). ENABLED = True as of v0.6.231. Full diagnostics in Layer 6.
One niche subtlety in the breakdown: long-lead (24-47h) sea-breeze forecasts get +7.85% MAE improvement with R5. At long leads, L2's τ=4h decay has long since faded, so R5 has something L2 doesn't. Not worth shipping a conditional correction for, but documented.