Forecast pipeline

Layer-by-layer anatomy of how Wyman Cove's forecast gets built — and how accurate it's turning out. loading…
Right now — what the pipeline is doing
Temp correction
vs raw model
Humidity correction
vs raw model
Confidence
— stations reporting
Briefing source

Status — where we are

Last curated: 2026-06-29 v0.6.257 · click any sub-box header to collapse it.
Since last curation (2026-06-29, v0.6.249 → v0.6.257)
Pipeline delta: L5 shipped + audited. L6 held under cleared disable-gate (decision pending). C1d KILLED by orthogonality (signal redundant with C1a). Calm-wind L3 skip gated and ready for flip after Stage 2 audit (~2026-07-02). Pattern emerging: regime × lead_band analyses on ws/wg/cc/cm all show the walk-forward validator's flat-drop verdicts hide regime-specific weakness — per-(field, regime, lead_band) whitelist work queued for Tuesday/Wednesday and supersedes the flat drop-cc and drop-ws paths. C1 calibration moved 47.92% → 61.36% with re-curate; still HOLD. L5 first clean 7-day audit window closes ~2026-07-05. L6 full 7-day window clean by 2026-07-03.
One-line summary: 5-layer correction stack in production. L1 raw HRRR → L2 aggregate-bias correction (Kalman-blended local network: temp, humidity, pressure, wind, cloud) → L3 lead-decay correction (per-lead-hour) → L4 diurnal correction (hour-of-day) → L6 microclimate correction (temperature, conditional on wind octant × sea-breeze × hour). Every "current conditions" card reads the L1→L6-corrected hourly[0] value. L5 synoptic-regime correction (solar) built, gated. C1 confidence layer (first non-MAE-reducing output — named C-class to distinguish from L-class correction layers that change forecast values) built with multi-axis support: C1a regime transition, C1b cluster spread, C1c pressure tendency, C1f precip-forecast presence. Gated pending calibration audit. R2 state-stratified accuracy and R6 regime-transition penalty audits run every Fitter cycle. R3 (derived humidity), R4 (HRRR-GFS spread), R5 (cove cross-current global formulation) retired.
✅ Production stack
  • L1 → L2 → L3 → L4 → L6 pipeline. L2 applies to t / dp / h / pr / cc (additive bias with Kalman gain) and ws / wg (direct selection).
  • L6 — microclimate correction (temperature, shipped 2026-06-26). Per-lead Δ°F applied to corrected_temperature: each forecast lead gets the Δ for its own projected regime (forecast wind dir + parsed local hour + heuristic sb_active). Indexed by (wind octant × sea-breeze active × hour-of-day). Built off a 12-day waterfront-vs-inland spatial gradient log (n=1,732) — the first layer trained on a within-network spatial signal rather than aggregate forecast error. Two PASS reads on r5_cove_analysis.py cleared the post-build confirmation gate (sea-breeze +1.55°F warming, morning offshore −1.21°F cooling). Module: weather_collector/processors/cove_correction.py, ENABLED = True.
  • L3 whitelist: ws, wg, ch, cm, pp. Each field remains enabled only if it continues to beat the layer below on the rolling 7-window held-out audit.
  • L4 whitelist: ch, cc. cc added 2026-06-24 v0.6.214 after a 2-read ≥3% gate cleared. Most other fields lack stable hour-of-day structure.
  • L2 lead-decay τ refit live by the Fitter with a per-field guardrail (fitted τ adopted only if it beats default on held-out RMSE and lands within 0.25×–4× of default; otherwise fall back).
  • L2 wind selection: per-octant max → median across octants for the local network. KBOS+KBVY authoritative-source floor overrides when the airports' median speed exceeds the octant pick by >1.4×. Direction-outlier guardrail rejects chosen direction if it's >60° off the airport+buoy+Tempest consensus. Physical sanity floor enforces gust ≥ wind on the final values.
  • L2 cloud blend: KBOS+KBVY METAR sky-condition obs (cloud_cover + L/M/H splits) Kalman-blended against HRRR hourly[0]. Cloud-tuned gain: K=0.90 (σ<20pp), K=0.70 (σ<40pp), K=0.50 (σ≥40pp), K=0.35 (single source). Treats disagreement between KBOS (Boston) and KBVY (Beverly) as a real coastal spatial gradient rather than sensor noise. Same K applied to total + splits to keep them self-consistent.
  • Current conditions: The current-conditions card is populated only after every correction layer has run from weather_data["current"] ← corrected hourly[0], so the displayed "now" exactly matches the corrected forecast pipeline. Preserves condition_source from upstream observation overrides; re-derives weather_code from corrected cloud + precip.
⏸ Gated off — built, not applied
  • [✅ Shipped 2026-06-28 v0.6.248] L5 — synoptic-regime correction (solar) moved out of this section. Gate cleared 7/7 ship days; solar_correction.ENABLED=True. Live in production.
  • C1 — multi-axis confidence layer. ENABLED=False in confidence_layer.py. First non-MAE-reducing layer in the stack. Three axes: regime-synoptic transition (state_fc vs state_obs), cluster-spread quartile (Q1/Q23/Q4 from KMAMARBL/KMASALEM/KMASWAMP temp medians), and pressure-tendency bin (falling_fast / falling / flat / rising from state_fc.pressure_trend_hpa_3h). Per (field, lead_band) cell carries per-axis displayed MAE; runtime classifies live axes each tick, looks up the matching cell in c1_confidence_curated_v2.json, falls back to legacy single-axis cell on miss. C1 is evaluated by calibration, not forecast error.
📡 Stage 2 — auto-wired audits (logged every Fitter cycle)
  • L5 — synoptic-regime correction (solar). Verdict in conditional_audits.l5; recency-weighted MAE since v0.6.178 (was unweighted 30-day average, now exp(-age_days/14) matching the rest of the Fitter). Promotes when 7 consecutive Fitter-cycle days return SHIP; trailing gate auto-computed in l5_gate_history.json since v0.6.180 and surfaced beneath the L5 row in S1 below.
  • R6 — regime-transition penalty. Verdict in conditional_audits.r6; recency-weighted since v0.6.179 (same fix as L5). Watch-signal for C1a — flags if transition-penalty magnitudes drift.
  • Marine-layer Stage 2.5 watch. Verdict in conditional_audits.marine_layer_watch since v0.6.182. Per-cycle log of NE+morn (wd 45-105°, hour 4-9 EDT) cc bias; daily visibility on the Sun-morning weekly Stage 2 read.
🗄 Retired — ruled out, kept as institutional memory
  • R5 — cove cross-current. Verdict HOLD at −20.58% MAE on 32,816 pairs. L2's waterfront-weighted station blend already captures the signal; R5 double-counts. Script analysis/r5_audit.py kept for quarterly re-checks.
  • R4 — HRRR vs GFS spread. Verdict CLOSE on 112,877 joined pairs; max |ρ|=0.012 across all fields. Spread is not a useful uncertainty signal.
  • R3 — derived humidity. Magnus(T_corrected, T_d_corrected) equivalent to L2 network-blended humidity within noise (27k triples). Derived path kept for internal consistency; the hypothesis "derivation is more accurate" is closed.
⏳ Next scheduled decisions (dated)
  • 2026-06-28 (Sun) — DONE: L5 promotion gate cleared 7/7 ship days; solar_correction.ENABLED=True. Attribution fix + chart wiring + placeholder section landed by 2026-06-29.
  • 2026-06-27 (Sat) — DONE: Walk-forward read on cc/cl L3/L4 inclusion under KBVY-blended obs. No additions; validator instead wants to drop existing fields. (See 06-29 read below for latest counts.)
  • 2026-06-27 (Sat) — DONE: C1 calibration audit (single + v2). Legacy HOLD at 47.92% pass rate. v2 multi-axis DEFERRED to ~2026-07-04. confidence_layer.ENABLED stays False.
  • 2026-06-28 (Sun) — DONE: C1d (KBOS-vs-KBVY cloud disagreement) returned SMOKE_ALIVE after only 24h of post-wiring rows. Orthogonality follow-up landed 2026-06-29 (next entry).
  • 2026-06-29 (Mon) — DONE: h_cloud_disagreement_orthogonality.py returned KILL C1d. Holding C1a (transition) fixed, the σ_HIGH/σ_LOW MAE ratio inverts to <1.0 in 3 of 4 cells that cleared the n floor — yesterday's SMOKE_ALIVE was capturing the regime-transition signal C1a already encodes. C1e check insufficient (n=0). C1d does not promote.
  • 2026-06-29 (Mon) — DONE: Walk-forward L3/L4 read #4. L3 wants to drop pp + ws (gate 2/7 — ws is new today, streak reset). L4 wants to drop cc (gate 5/7 — 2 reads to clear). No whitelist edits.
  • 2026-06-29 (Mon) — HELD: Cove/L6 disable-gate cleared 2/2 (r5_cove_analysis second consecutive HOLD). Holding L6 enabled while investigating; morning offshore regime weakening (-0.67°F vs -1.0°F threshold) likely seasonal, sea-breeze regime still strong (+1.69°F).
  • 2026-07-02 (Thu) — UPCOMING: Earliest reasonable flip date for CALM_GATE_ENABLED=True in decay_apply.py (calm-wind ws/wg L3 skip). Needs 2–3 daily digest cycles of l3_regime_lead_analysis showing the calm-cell L3 LOSES verdict is stable. If flipped, supersedes the flat drop-ws path the walk-forward gate is currently counting toward.
  • 2026-07-03 (Fri) — UPCOMING: Earliest possible L4 drop-cc gate clear (5/7 + 2 more SHIP-direction reads). Also: L6 full 7-day pair window clean of broken-impl rows; L6 disable-gate maturity check if held this long.
  • 2026-07-04 (Sat) — UPCOMING: C1 v2 multi-axis calibration audit first eligible (cluster_spread log accumulates over ~7 days from build).
  • 2026-07-05 (Sun) — UPCOMING: L5 first clean 7-day audit window closes — first real Fitter audit of L5 vs L4 on solar.
  • 2026-07-06 (Mon) — UPCOMING: Earliest possible L3 drop-pp+ws gate clear (currently 2/7).
  • Watch: L6 cove disable-gate. Held but in a cleared state; each subsequent r5_cove_analysis HOLD verdict reinforces the case; a single SHIP flips the gate back. Decision point if signal stays weak through ~2026-07-03.
🧪 Open architectural questions
  • L2-as-observation-only: Remove L2 from forecast pipeline; keep L2 only as training target for L3/L4/L5. Open question; not today's work. See memory note.
  • ws L3 long-lead regression — REFRAMED 2026-06-29: New l3_regime_lead_analysis.py reveals the dominant L3 failure mode is calm forecast wind, not lead distance. ws/wg L3 LOSES -20% to -69% MAE when forecast wind <3 mph; WINS +5% to +47% when ≥3 mph. Decay-apply now carries a gated calm-wind skip (CALM_GATE_ENABLED=False in decay_apply.py, v0.6.252) — Stage 2 candidate, audit a few cycles of shadow data before flipping. If gate-flip lands cleanly, this supersedes the flat drop-ws path the walk-forward gate (2/7) is currently counting toward.
  • Per-regime gating for L6 (cove): Today's r5_cove_analysis (2026-06-29) showed sea-breeze regime passing strongly (+1.69°F, n=353) while morning offshore failed (-0.67°F vs -1.0°F threshold). Disable-gate cleared 2/2 in the disable direction but we're holding because killing the whole layer kills the sea-breeze win too. Open: should L6 apply per-regime instead of all-or-nothing, and should the lookup table refit with recency weighting so seasonal drift (morning marine cooling attenuating toward solstice) gets absorbed instead of fighting the audit? Investigation deferred to post-2026-06-30.
  • Specialists vs layers naming: L5 and L6 are field-specific specialists (sr-only, t-only), not general-purpose layers like L2/L3/L4. The "Layer N" naming will grow noisy with every new specialist. Plan: group field-specific correctors under a parent "Field-specific corrections (specialists)" section; keep L_N internally for stack order but use plain-language names in headings. Restructure deferred to post-2026-06-30. See project_specialists_vs_layers memory note.
  • Per-(field, regime, lead_band) whitelist for L3 + L4 — a meta-pattern (2026-06-29): The (regime × lead_band) cross-cut has now been run on four fields (ws, wg, cc, cm) and each time the walk-forward validator's flat-drop verdict would have killed wins in most regimes to fix a regime-specific weakness in one. Discovery rule: always run l3_regime_lead_analysis / l4_regime_lead_analysis before acting on any walk-forward drop recommendation. Implementation question: should decay_apply.py grow a per-(field, regime, lead_band) skip table (analog to the calm-wind gate but conditioned on observed regime instead of forecast wind speed)? That's a larger architectural commit than the calm-wind gate — defer to Tuesday/Wednesday.
How to read the rest of this page. Forecast accuracy section is the bottom-line MAE per layer per field (the green box on each chart = what users actually see). Layer sections (L1–L4 general-purpose, plus L5 solar and L6 temperature as field-specific specialists) each show the live state of that layer and the diagnostics behind it. L5's section is a placeholder for now — see "Where we are" inside its header — with full subsections coming after the specialists refactor. Research & Diagnostics (bottom) holds R2 (state-stratified accuracy) and R6 (regime-transition penalty, C1a's input signal) under "Active hypotheses"; the curated Stage-1 Backlog list (8 candidates grouped A/B/C); operational tools G1/S1/B1/F1 (gated candidates, shadow whitelist, backtest sweep, frontal events); and a Retired collapsible (R3 derived humidity, R4 HRRR/GFS spread, R5 cove cross-current — all ruled out by audits). The shadow tuner now carries a 7-cycle promotion-gate counter (v0.6.147) — when an unchanged recommendation holds 7 Fitter cycles, it's eligible to weigh into the next walk-forward read.
Conventions: L = correction layer (applied to forecast). R = research hypothesis (logged, not applied). S = shadow tuner (what auto-tuner would do). G = guardrail check / candidate stamp. F = failure/diagnostic. B = backtest. D = drill-down / teaching view.

Forecast accuracy — how accurate is it?

For each field, this shows how far off the forecast tends to be at each future hour — and how much each correction layer narrows that gap vs the bare raw model.

How to read these charts
What you're looking at. One small chart per weather field. The Y-axis is how wrong the forecast tends to be (in that field's units — °F, %, mph, etc.). The X-axis is hours into the future. Lower lines = more accurate. Lines climbing to the right is normal — forecasts get worse the further out you predict.
The lines are cumulative correction layers. L1–L4 stack across every field; L6 only stacks on the temperature card. Lowest line = lowest typical error = what users actually see.
Why lines often lie on top of each other. Two different reasons, and the badges above each chart tell you which:
  • Layer not applied to this field (badge shows L3 off or similar). The walk-forward validator showed this layer makes things worse for that field, so we turned it off. The line for that layer is identical to the one below it by construction.
  • Layer is applied but the correction is tiny (badge shows L3 ✓). The raw model is already very close, so the correction barely moves the needle. Lines visually overlap but they're not identical.
L2 badge variants: L2 ✓ additive means a bias is added (temperature, humidity, pressure). L2 ✓ direct means we replace the model wind with what the stations are reading. L2 n/a means no station network applies a bias for this field (clouds, solar, precip). Note for clouds: L2 is still n/a as a forecast correction, but obs truth for cc/cl/cm/ch now comes from a KBOS + KBVY METAR blend (v0.6.134) — feeds the joiner so L3/L4 can be evaluated, not applied at L2.
The table beneath each chart shows the same data as the chart but averaged into lead bands (0-5h, 6-11h, etc.). Numbers in field units. Compare a row across columns to see how each correction layer changes the error in that lead band — this is exactly the bucketing the walk-forward validator uses to decide what gets shipped.
Lead 0 is circular by construction (forecast for now compared against the same-moment mesonet, so L2 ≈ 0); look at lead 1+ for real signal. Source: time_series_diagnostic.json::per_layer_mae_by_lead, 7-day window.

Layer 1 — Raw model (HRRR / GFS)

The bare government weather model. Knows nothing about Wyman Cove specifically. Everything below corrects what it gets wrong here.

About the raw model The starting point. Open-Meteo's HRRR (next 48h) and GFS (days 3–7) numerical weather models. Multi-kilometer grid resolution; the model has no specific knowledge of Wyman Cove. Every layer below corrects what the raw model gets wrong locally.
Curves below show the current raw model forecast per field — same dotted line that appears in the drill-down above.

Layer 2 — Aggregate-bias correction (local station network)

What our 66 nearby weather stations say the model is getting wrong right now — and how much of that signal we trust.

How the network correction is built Two parallel aggregation paths under this layer, each suited to the noise behavior of its metric:
Temp / humidity / pressure (additive bias): each station's reading first calibrated against its own chronic offset (Kalman-tracked, rolling 48h, see 2a accordion). Per-octant 1/distance² × exp(-|elev_diff|/30)-weighted mean of (station − model) bias. Final network bias = unweighted mean across non-empty octants. Temperature and humidity additionally get scaled by their own network Kalman gain K (sec 2c) before being added to the raw model — separate K functions per field since temp scatter is in °F and humidity scatter is in %. Pressure is applied at full strength (station consensus matches model after altitude offsets).
Lead-decay: the bias is not applied flat across all 48 lead hours. Each field gets bias_applied(lead) = current_bias × exp(-lead/τ) with a per-field τ. The Fitter (twice daily at 03:xx and 15:xx local) refits τ on a train/test split — pairs older than the last 2 days fit τ; the last 2 days score it. Both fitted τ and held-out RMSE deltas are written to l2_decay.json. The loader applies a per-field guardrail: adopt the fitted τ only if it beat the hardcoded default on held-out RMSE (≥0% improvement, ≥100 test pairs) AND the fitted τ is within 0.25×–4× of the default. Otherwise fall back to the default. Hardcoded defaults: τ_t=4h (temp bias decays fast — useful at short leads, near-zero by 24h), τ_h=240h (humidity bias persists across the whole horizon, essentially flat), τ_pr=12h (pressure ~half-life 8h). Fields without a τ (wind, clouds, solar, etc.) get flat L2 application. See sec 2d below for the live curves and per-field adoption status.
Wind / gust (direct selection, no bias): no per-station calibration (per-station wind biases are too noisy to track meaningfully) and no additive bias term. Per-octant MAX gust → MEDIAN across populated octants. The chosen value is blended into the next 24h of the hourly forecast with a linear-decay weight (100% at hour 0, 0% at hour 24). KBOS+KBVY authoritative-source floor: when both airport METARs agree on a wind speed >1.4× the octant-median pick, defer to their median. Mirrors the WU_CAP guardrail in the opposite direction. Direction-outlier guardrail rejects the chosen direction if >60° off the airport+buoy+Tempest consensus. Gust override allows single-source (METAR omits gust when wind is steady). Physical sanity floor enforces gust ≥ wind on the final values.
Cloud cover (Kalman-blended METAR override): KBOS + KBVY METAR sky-condition obs (BKN/SCT/FEW/OVC translated to cloud_cover_pct + L/M/H splits) blended against HRRR hourly[0] using _kalman_gain_cloud(n_sources, bias_std). Cloud-tuned: K=0.90 when airports agree within 20pp, K=0.70 with 20-40pp disagreement (treated as real spatial gradient, not sensor noise), K=0.50 with >40pp disagreement, K=0.35 when only one source is present. Same K applies to total cloud and L/M/H splits to keep them self-consistent. See cloud_l2_meta on each tick's weather_data.hourly for live K, σ, blended values.

2a. Octant coverage — where this tick's stations came from

loading…
Per-station detail — mesonet map & Kalman-tracked offsets table
Per-station uptime — fetch success rates (rolling window)

2b. Network bias estimate (full, un-confidence-scaled)

loading…

2c. Network confidence (Kalman gain K)

loading…

2d. Lead-decay applied to L2 bias (v0.6.44)

loading…

2e. Post-aggregate-bias forecast — what gets passed to Layer 3

Layer 3 — Lead-decay correction

How the model's error tends to grow with each hour into the future — and the per-field nudge we apply to push back against that drift.

How decay correction works For each field and each lead hour, we learn from millions of historical pairs how far off the model usually is — then bend the forecast back toward truth by that amount. Some fields (wind, gusts, high & mid cloud, POP) genuinely benefit; others net-negative on held-out data and are paused (see banner). Tracked over a 30-day window with an exponential recency weighting — default τ=14 days, with per-field overrides for fields where analysis/decay_tau_tuning.py shows ≥5% MAE improvement vs the default. Current overrides: pp (POP) at τ=28d (+11.1% held-out, 2026-06-21 v0.6.167) and pa (precip amount) at τ=28d (+9.4% held-out, 2026-06-22 v0.6.195). Both precip fields preferred a smoother bias estimate than the τ=14 default.

3a. Fitted correction curves — what is being applied per lead hour

3b. Live forecast — with vs without decay correction

3c. Decay curves over time — historical fits

Layer 4 — Diurnal correction

A separate correction for the part of model error that follows the sun — e.g., bias that's different at 3 AM than at 3 PM. Currently active only for high cloud.

How diurnal correction works Bins historical errors by hour-of-day (0–23) and fits the persistent pattern at each hour. Most fields don't have a clean diurnal signal once L2/L3 have run, so this layer is whitelisted to just high cloud — the only field where it consistently beats L3 alone on held-out data. Disabled-field fits are still computed below for diagnostic purposes.

4a. Diurnal correction curves over time — historical fits

Layer 5 — Synoptic-regime correction (solar)

A per-regime W/m² delta applied to direct solar radiation. The classifier reads current wind direction, speed, pressure trend, hour, and temp; the regime label keys into a calibrated per-regime delta. L1–L4 are trained on general bias correction; L5 is the first layer trained on synoptic-state stratification — different "kinds of weather" get different corrections.

How L5 works (compact summary)
Each tick, solar_correction.stamp_solar_correction() classifies the current synoptic regime (one of nw_flow / ne_flow / sw_flow / se_flow / sea_breeze / frontal / pre_frontal / nor_easter / calm) using regime_classifier.classify_synoptic_regime(). It then looks up the calibrated delta for that regime and applies it to every lead's direct_radiation where the lead's raw value is above SUN_UP_THRESHOLD (50 W/m²). The live moment's raw also gates the delta — if it's pre-sunrise locally, delta = 0 regardless of regime.
Live snapshot reads via weather_data["solar_correction"] — exposes candidate_delta_wm2, applied, and the classified regime metadata. Per-layer attribution lives in direct_radiation_post_l4 (pre-L5 array, preserved before mutation) and live direct_radiation (post-L5). The snapshot writer emits sr_l4 and sr_l5 separately so the Fitter can audit L5 vs L4 cleanly.
Where we are (2026-06-29):

5a. Live correction — what is being applied right now

Placeholder — populated with the specialists refactor (post-2026-06-30). Today, see weather_data["solar_correction"] in the snapshot for live state.

5b. Regime classifier — current state breakdown

Placeholder — populated with the specialists refactor. Today, the classifier state is logged each tick in solar_correction.regime.

5c. Per-regime delta table

Placeholder — populated with the specialists refactor. Today, see solar_correction.REGIME_DELTAS for the calibrated per-regime W/m² values.

5d. L5 vs L4 audit (held-out MAE)

Placeholder — populated with the specialists refactor. Today, the per-layer Forecast Accuracy chart's sr card is the canonical audit; first clean read ~2026-07-05.

Layer 6 — Microclimate correction (temperature)

A small Δ°F added to the temperature forecast based on a within-network spatial signal: how much the waterfront stations (Willow Rd, Neptune Rd) typically diverge from the inland-network median under different wind / sea-breeze / hour combinations. L1–L5 are trained on forecast-vs-aggregated-obs errors; L6 is the first layer trained on a spatial differential between station subgroups. Two physical regimes: cove warms a few °F under S/SE/SW sea-breeze (peninsula-lee heating) and cools a few °F during 09–16 EDT when the sea breeze is inactive (peak −3.7 °F around 12:00 EDT; a cool marine pool over Salem Sound persists while inland warms).

How L6 works
Each forecast lead gets its own Δ°F. For lead i, the collector reads the forecast wind direction (hourly.wind_direction[i]), parses the local hour from hourly.times[i], and applies a heuristic sb_active (on during 13–18 EDT with S-half wind, off otherwise — coarser than the live sb detector but the only fields we have forecast for). It then looks up a Δ°F from one of two tables — (sb_active, wind_octant) when sb is on, (hour_of_day) when sb is off — and adds it to that lead's corrected_temperature cell. Tables built from a 12-day waterfront-vs-inland gradient log (cove_gradient_log.json, n=1,732); cleared a 2-read confirmation gate on r5_cove_analysis.py (2026-06-25 SHIP + 2026-06-26 SHIP, both regime tests PASS).
Why per-lead matters. The first ship (v0.6.231) applied the current-tick Δ to all 48 leads — wrong by 3–5°F at distant leads when the table swing crossed zero (e.g. applying noon's −3.7°F to a midnight lead). Per-lead projection (v0.6.237) fixed this. Across the 48-hour horizon the per-lead Δ now ranges roughly −3.7 to +2.0°F with a near-zero mean, exactly as expected from the lookup-table shape.
Where L6 is evaluated. Two places: (1) 6d below uses cove_gradient_log.json directly (cove-specific obs vs displayed temp); (2) the Forecast Accuracy chart's temperature card has a green L6 line built from pair-log rows with error_l6. Pre-deploy pair rows (06-26 ~08:00 → 06-26 17:19) used the broken uniform-Δ implementation and are filtered out of the Fitter's L6 aggregation — the L6 line will populate only with post-v0.6.237 rows. Full-window clean read by 2026-07-03 once the bad rows age out naturally.

6a. Live correction — what is being applied right now

6b. Lookup tables — full per-regime Δ°F catalog

6c. Waterfront-vs-inland Δ over time

6d. L6 evaluation — cove obs vs (L4 only) vs (L4 + L6)

Research & Diagnostics — experimental signals + audit views (not applied to live forecast)

Diagnostics

R0. L3/L4 audit table — is each layer earning its keep?
What this is: the headline diagnostic for the correction stack — average held-out MAE per field, per layer (across leads 1–47). Recomputed every Fitter cycle. Each MAE cell shows the size of the typical error; the dim subtext next to it is the bias (signed mean error), revealing systematic offsets MAE alone can hide. The Δ columns compare each layer to the one below — green means it beats AND it's applied; amber means it beats but is NOT applied (missed opportunity, matches the orange banner color); red means it loses; gray "0.00" means tie. The Applied? columns are color-coded Yes (green) / No (red) / — (gray, n/a for L2 fields that don't have an obs network). Two banners watch for trouble: the red one fires if any enabled layer is losing by >3% in some lead band (1–6h / 6–24h / 24–47h) — catches regressions hidden by overall averages. The orange one fires if any disabled layer is winning by >3% in some band — catches opportunities we're leaving on the table. Both lists exactly which (field, layer, band) triggers so the alerts are actionable.
D1. Drill-down — see each correction layer build up (teaching view)
What this is: a visual build-up of the live forecast, layer by layer. Pick a field, pick layers, hit Play to watch L1 → L2 → L3 → L4 animate onto the chart. For fields where the L3/L4 whitelist disables a layer, that layer's line sits exactly on top of the layer below — there's no correction being applied, so nothing changes visually. Useful for sanity-checking a specific field's stack when something looks off elsewhere.
Fields
Layers

Active hypotheses

R2. State-stratified accuracy — which regimes does the model fail in? (active)
What this is: per-field MAE sliced by atmospheric regime (wind direction, wind speed, cloud cover, pressure trend, flow regime, synoptic pattern). When a field's MAE varies a lot across regime bins, that's a sign a regime-aware correction layer could help — different regimes need different corrections, and a one-size-fits-all decay correction (L3) misses the structure. Where the current ranking points: solar dominates the top 5 opportunities — synoptic regime gives a ~144 W/m² bias spread between best and worst bins. L5 (regime-aware solar) in solar_correction.py was built off this signal (gated off; see G1). Re-fit twice daily by the Fitter, published to state_stratified_accuracy.json.
R6. Regime-transition penalty (Stage 2 — auto-wired in Fitter)
Hypothesis: pairs where state_fc.regime_synoptic (regime the model predicted for the obs hour) differs from state_obs.regime_synoptic (regime that actually materialized) show materially worse MAE than "stable" pairs where the regimes agree. If true, the system should widen confidence bands when the model itself signals a regime transition in the forecast window. Data: pair log + state metadata.
✓ Promotion gate passed (7-window agreement)
Promotion gate run via analysis/simulate_windows.py across 7 trailing daily cutoffs on 7-day windows. All 7 cutoffs returned SHIP. ~25 of 56 (field × lead band) buckets show ≥10% transition penalty. Strongest effects: wind speed +73% at 0-5h, wind direction +45–72% across all bands, wind gust +63% at 0-5h, temperature +12–24% at 0-23h. ~40% of pairs are "transition" pairs. Solar at 12-23h improves by ~19% on transition pairs (transitioning-to-clear is easier than stable-cloudy).
Wired into the stack: R6 is C1a (regime-synoptic transition axis) — drives C1's confidence-widening table. Verdict logged under conditional_audits.r6 on every Fitter cycle; surfaced via S1 alongside L5.
Manual script: analysis/regime_transition_audit.py (single-window detail). Promotion gate: analysis/simulate_windows.py (7-window agreement test). Live verdict appears in S1 below.
Backlog — Stage 1 candidates (curated text, not yet running)
What these are: hypothesis ideas at Stage 1 of the promotion pipeline — written down, not yet wired to a script. Promotion path: Stage 1 (this list) → Stage 2 (one-off analysis/* script + verdict) → Stage 3 (auto-wired in Fitter, logged per cycle) → Stage 4 (shipped layer or confidence widening).
Framing (2026-06-20): the 8 ideas Joe listed don't peer-rank cleanly as "8 future layers." Most surviving hypotheses measure forecast uncertainty, not forecast bias — they belong as axes of C1, not as standalone L7/L8. The two real bias-correction candidates (marine layer, radiational cooling) overlap heavily with L2 (waterfront capture) and L4 (diurnal). Grouped below by how they should actually be worked.

Group A — C1 multi-axis confidence extension

C1 v1 widens/narrows confidence on a single axis (regime-synoptic transition, shipped 06-19). v2 multi-axis plumbing (cluster-spread + pressure-tendency) shipped 06-20 in v0.6.151. KBOS-vs-KBVY cloud disagreement is the third axis candidate, waiting on dual-source data. All feed the C1 Stage 3.5 calibration audit on 2026-06-26.

Group B — Bias candidates (paced, individually)

Group C — Lower priority (dominated by existing layers)

📊 Stage 1 candidates — 2026-06-24 prioritization
Joe's call (2026-06-23): the Stage 1 queue is deep enough that not all candidates can ship at once. Manual re-runs over the next 2-4 weeks will decide which graduate to Stage 2 implementation. Tier system below ranks by expected value (probability × user-visible impact × implementation effort). 2026-06-24 batch re-run: all 8 manuals re-fired with fresh data; cc→L4 recovered to +5.0% (2nd ≥3% read), C1f +2 ortho, K-taper held, dp depression added nor_easter +3.79°F, C1e post weakened (6→3 ortho).
TierCandidateWiredLast manual runPromote criterion
SHIPPEDC1f precip_fc>0🟢 Auto-wiredSTAGE 2 SHIPPED 2026-06-24 v0.6.215 — 4th axis added to confidence_layer.py. Curated v3 table: 296 SHIP / 42 MARGINAL / 1048 SKIP across 39 axis-keys. 14d window, 1.29M pairs. ENABLED=False (Stage 4 gates UI consumption).✓ promoted; live next collector tick
SHIPPEDHumidity K-taper🟢 Auto-wiredSTAGE 2 SHIPPED 2026-06-24 v0.6.218 — soft_ramp wired in corrected_hourly.py: K(0h)=1.0 → K(24h)=0.4, floor 0.4 for leads 24-47. h_lead_l2_ktaper_sim.py reads: +7.75% (06-22), +6.60% (06-24).✓ promoted; live next collector tick
SHIPPEDcc → L4🟢 Auto-wiredSTAGE 2 SHIPPED 2026-06-24 v0.6.214 — cc added to L4_FIELDS in decay_apply.py:70. Last sim read: cc +5.0%, cm +3.0% (06-24). cm rides along candidate; confirm 06-29.✓ promoted; live next collector tick
2Cloud saturation-unbiasing⚫ Manual2026-06-24 (h_cloud_floor_ceiling.py) — cl 95-100 +57.5pp (was +63.4); cc +31.2, cm +55.1, ch +49.9. Direction-stable.cl 95-100 stays ≤-50pp ×3 reads (1/3 — 2nd ≥50 confirmed)
2→3?C1e bidirectional⚫ Manual2026-06-24 (h_hsf_orthogonality.py, h_pre_front_orthogonality.py) — post 3 ortho ↓ from 6, pre 8 long-lead (flat)post weakening — ch still holds (3/4 ortho); de-prioritized
KILLEDC1g RH≥95% fog— retired2026-06-24 (h_c1g_orthogonality.py) — 1 ortho / 69 redundant / 2 ambiguous across 72 cells. Stage 0 +134% cm elevation was sampling-driven (fog co-occurs with C1f + cc-sat).✗ killed — moved to Retired
3C1h trend-direction⚫ Manual2026-06-24 (h_trend_direction.py) — cl rising +999% (n=194 — small, stable from 06-23)cl rising ≥+500% on 30d window + ortho
3dp depression regime⚫ Manual2026-06-24 (h_dewpoint_depression.py) — frontal -1.98°F, sea_breeze +1.45, sw_flow +1.40, nor_easter +3.79★ NEW (n=279)t-vs-dp attribution clear + signal holds (✓ direction-stable)
KILLEDWind shift rate (Δwd_3h)— retired2026-06-24 (h_wind_shift_rate_orthogonality.py) — 1 ortho / 22 redundant / 2 confounded / 11 ambiguous; C1a captures the signal✗ killed — moved to Retired section
Legend: 🟢 Auto-wired (runs every Fitter cycle or tick) · 🟡 Hybrid (partially wired, e.g. stamp lives but verdict needs manual replay) · ⚫ Manual (Stage 0/1 — decisions wait on hand-run scripts) · 🔒 Gated off (built + deployed, ENABLED=False).
Total in pipeline: 6 active candidates (was 10 this morning). Today's deltas: cc→L4 SHIPPED Stage 2 (v0.6.214), C1f SHIPPED Stage 2 (v0.6.215), wind_shift_rate KILLED (v0.6.216), C1g KILLED (v0.6.217), Humidity K-taper SHIPPED Stage 2 (v0.6.218). Tier 2 candidates promote to Tier 1 when their criterion holds across at least 2 manual re-reads spaced 3+ days apart. Tier 3 candidates need more evidence before deserving production-stack architectural commitments.

Group D — Methodological refinements (modify existing layers, not new ones)

Promotion rule: write a single-shot script in analysis/. Stage 2 verdict must hold across at least 2 reads spaced 3+ days apart. Group A candidates promote to C1 axes (C1a, C1b, C1c, ... — no R-number); Group B candidates earn R-numbers if they survive the 7-window walk-forward gate.

Stage 0 explorations — completed, design seeds + breadcrumbs

Surviving Stage 0 outputs: design seeds for future hypotheses, breadcrumbs pointing to promoted Stage 1 candidates above, data-limitation flags, and one open bug. Kills + methodological nulls live in the Retired section below (single source of truth).

Operational tools — live audits & shadow tracking

G1. Gated correction candidates — what L5 / C1 would do right now
What this is: two corrections are sketched in code with ENABLED = False. They stamp their candidate values on weather_data every tick so we can observe what they would do without actually modifying the forecast.
Loading…
S1. Shadow whitelist tuner — what would auto-tuner have chosen?
What this is: per Fitter cycle, log what whitelist sets a naive MAE-based auto-tuner would recommend, alongside the current production whitelist. Threshold: a field is recommended ON if its layer beats the layer below by ≥3% in any lead band AND bias is no worse. Why it's here: the precondition for considering automation is "shadow tracks human decisions consistently." After 90+ days we can evaluate agreement rate. Until then, mismatches are informative (the pp case is a known Brier-blindspot), not actionable.
Loading…
B1. Backtest sweep — alternative L3/L4 configs vs production
What this is: live A/B comparison of candidate L3/L4 whitelists against production, computed by replaying the held-out pair log under each enable config. Lets us see what MAE would have looked like under any candidate config without waiting for a redeploy. Current production is L3 = {ws, wg, ch, cm, pp}, L4 = {ch}. Performance note: sweep run via python3 -m backtest.sweep --write-gcs (use --local-file ~/.cache/myweather/forecast_error_log.jsonl for fast iteration). Results below are from the last manual run.
Loading sweep results...
F1. Frontal passage log — detector instrumentation (last 14 days)
What this is: live readout of detected frontal passages from frontal_events_log.json. The detector runs every tick in frontal_detection.py, reads a 90-minute rolling window from frontal_obs_log.json, and declares a passage when at least 2 of 3 signals fire: dewpoint drop > 8°F, wind direction shift > 60°, pressure inflection (local min then ≥0.02″ rise). Type classification uses the wind-shift target octant and pressure trend. Confidence is 67% with 2 signals and 100% with 3. Why it's here: sanity-check whether real fronts are being caught (and whether noise is being mistaken for fronts) before letting the briefing AI rely on the cause-attribution line. If the detector misses an obvious front or fires on a non-event, thresholds in frontal_detection.py are tunable.
Loading detected passages...
Retired — hypotheses ruled out & settled tunings (collapsed; click to expand)
What these are: things we built, ran, and stopped running. Two kinds get mixed here: hypotheses (a real question we tested and the data answered no) and settled tunings (a parameter sweep that concluded "current value is fine"). Each entry is tagged. Kept as institutional memory so future-Joe doesn't burn a day re-inventing them. Standalone scripts in analysis/ can be re-run at any time if conditions might have shifted.
Recently ruled out — 2026-06-22 to 06-24 Stage 0 kills
Compact entries — these were one-shot smoke tests that landed at "no signal" or "captured by an existing axis." No charts kept. Re-run the script in 2-3 months if the seasonal regime shifts substantially.
[HYPOTHESIS] Tide-phase corrections — does forecast error track the tide cycle? (RETIRED 2026-06-08)
Verdict: weak signal, mostly entangled with diurnal cycle. Per-field tide-phase curves were tracked across weeks; the signal that survived stratification was hard to distinguish from hour-of-day patterns we're already correcting in L4. Cost of keeping it running (NOAA fetch + 12-bin accumulator + GCS history per Fitter pass) wasn't justified. Analysis: analysis/tide_hypothesis.py. Fitter module flag: RUN_TIDE_TRACKING. Frozen charts below show the final state at retirement; they will not update.
Companion view — error vs tide elevation over time (frozen)
Time-domain rendering of the same data as the bucketed chart above. Different angle on the same retired hypothesis. The frozen state below is the last fit before tide tracking was disabled.
Higher leads = forecast made further ahead. Switch to see if the tide pattern is lead-specific.
[HYPOTHESIS] Derived humidity — Magnus(T_corrected, T_d_corrected) vs network-blended humidity (RETIRED 2026-06-08)
Verdict: equivalent. Tested whether deriving humidity from corrected temperature + corrected dew point via Magnus outperforms the L2 network-blended humidity. 27k triples, identical MAE within noise. We kept the derived path anyway because it keeps the (T, T_d, RH, AH) quadruple internally consistent — but the hypothesis "derivation is more accurate" is closed. Analysis: analysis/derived_humidity.py.
[SETTLED TUNING] L3/L4 recency window τ — sweep over fit-window decay constant (RETIRED 2026-06-08)
Verdict: τ=14 days is fine within noise. Not a hypothesis — a parameter sweep over the Fitter's recency-weighting τ (how much old pairs count when fitting decay curves). Not the L2 lead-decay τ added in v0.6.44, which controls how a current bias is spread across forecast leads (see sec 2d). Tested τ ∈ {7, 14, 21} days across four reports. Held-out MAE differences under 2%, well below run-to-run variance. τ=14d stays. With L3/L4 mostly disabled in v0.6.45, this knob barely matters anymore. Analysis: analysis/decay_tau_tuning.py.
[HYPOTHESIS] R4 — HRRR vs GFS spread as confidence signal (RETIRED 2026-06-17, verdict: CLOSE)
Hypothesis: when HRRR and GFS disagree at a given forecast hour, actual error magnitude tends to be higher — i.e. |HRRR − GFS| per (field, lead) predicts |forecast − obs|. If true, the spread becomes a free uncertainty number that can widen displayed intervals and feed Gemini hedge language ("models disagree on tomorrow's high"). Data collection: HRRR L1 already in forecast_log.json; gfs_l1_log.json captures GFS L1 per tick for the same 0-48h window. Decision rule was: ship if median Spearman ρ > 0.25 for ≥3 fields, consistent across lead bands.
CLOSE verdict (2026-06-17, 112,877 joined pairs):
0 of 6 fields above the 0.25 ρ threshold. Maximum observed |ρ| = 0.012 (wind speed at 1-6h) — essentially zero correlation. HRRR vs GFS spread does NOT predict forecast error magnitude. Retired without auto-wiring. Manual script: analysis/r4_spread_analysis.py — re-run quarterly or after a model release.
[HYPOTHESIS] R5 — Cove warming — sea breeze across the peninsula heats Wyman Cove vs inland (RETIRED 2026-06-17, verdict: HOLD — L2 already captures it)
Reframed hypothesis (2026-06-13): Wyman Cove sits in the lee of the Marblehead peninsula on a S/SE/SW sea breeze. Marine air crosses ~2 miles of sun-heated land before reaching the cove, picking up surface heat in transit. Expected pattern: delta_wf_inland = waterfront_median − inland_median goes positive (cove warmer) when wind is from the S half AND sea breeze is active, with magnitude scaling to solar input (peaking ~12-14 EDT). Should flatten to zero when wind is from N/NE (cove is windward of peninsula) or after sunset (no surface heating). Original hypothesis ("waterfront cools during sea breeze") was geographically backwards and is closed.
Day-12 refit (1,732 ticks through 2026-06-24): matches the reframed model; magnitudes tightened further as the sample grew. NW flipped from neutral to weakly negative; E cooled further.
WindSea breezenmean Δ°F
Sactive186+1.5
SEactive88+2.0
SWactive79+1.1
Ninactive378−1.0
NEinactive103−1.0
Einactive86−1.3
NWinactive459−0.9
Diurnal curve under offshore/calm conditions shows clean morning-marine-cooling: trough around −3.7°F at 12:00 EDT (refit 06-24, n=1,732 entries over 12 days; cool air pool over Salem Sound persists; inland warms fast with sun while cove stays anchored to marine boundary). Both signals are physically coherent with the lee-warming model.
Data collection: cove_gradient_log.json captures waterfront-tagged Tempest median (Willow Rd, Neptune Rd — both at cliff-edge elevations on the harbor, confirmed by Joe), inland Tempest median (~18 stations), ambient T, wind dir/speed, salem_water_temp_f, buoy_water_temp_f, sb_active, sb_likelihood per tick (14-day retention).
Two-step plan:
  • Step 1 — measurement is stable (analysis/r5_cove_analysis.py). Confirms the (wind × sb × hour) lookup table reflects a real, repeatable physical signal. Day-4 already passes both regime tests; 7-day re-run on 06-19 just confirms stability.
  • Step 2 — held-out MAE audit (analysis/r5_audit.py). The actual ship question: does APPLYING the correction improve cove temperature forecast accuracy? Joins the pair log against the cove log, computes error_l4 + R5_delta and error_l1 + R5_delta, compares MAE against the existing L4-corrected baseline.
Step 2 verdict: HOLD (run 2026-06-16, n=29,444 matched pairs)
  • Baseline (existing L4-corrected, which for temperature = L2-corrected since L3/L4 are off): 2.547°F MAE
  • R5 added on top of L4: 3.045°F MAE (−19.58% — significantly worse)
  • R5 replacing the entire stack: 3.066°F MAE (−20.39% — also worse)
The L2-overlap hypothesis was empirically confirmed. L2's 1/distance² × elevation station weighting for the cove is dominated by the two waterfront Tempests (Willow Rd, Neptune Rd at ~0.1–0.2 mi). L2's "cove bias" is effectively "waterfront bias" by construction. Layering R5's (waterfront − inland) delta on top double-counts the same signal — the cove obs is already waterfront-influenced via L2, so adding more waterfront delta pushes the forecast AWAY from the obs.
Decision (global R5, retired 2026-06-17): r5_audit.py's held-out test of R5 applied across the full pipeline showed it makes cove temp 20–22% worse — L2's waterfront-weighted station blend already captures the signal, so layering R5 on top double-counts. That global formulation stays retired.
Current status (L6 — microclimate correction, shipped 2026-06-26): cove_correction.py is live — applies a per-lead Δ°F to each forecast lead of corrected_temperature, with the Δ for lead i looked up against the forecast regime at lead i (forecast wind direction, parsed local hour, heuristic sb_active). Per-lead replaced an initial v0.6.231 implementation that applied the current-tick Δ to all 48 leads (wrong at distant leads when the table swing crossed zero); the fix shipped in v0.6.237. Lookup tables remain the bidirectional gradient (positive on S-half sea-breeze; negative 09–16 EDT when the sea breeze is inactive, peak −3.7 °F around noon). Cleared a 2-read confirmation gate on r5_cove_analysis.py (06-25 SHIP + 06-26 SHIP, both regime tests PASS). ENABLED = True as of v0.6.231. Full diagnostics in Layer 6.
One niche subtlety in the breakdown: long-lead (24-47h) sea-breeze forecasts get +7.85% MAE improvement with R5. At long leads, L2's τ=4h decay has long since faded, so R5 has something L2 doesn't. Not worth shipping a conditional correction for, but documented.