`/status` — System Health Dashboard¶

Live per-layer health for every system the warehouse depends on.

URL: https://warehouse.caseymanos.com/status (Access-gated)
Polling: HTMX auto-refresh every 30s (just the cards swap)
JSON: https://warehouse.caseymanos.com/freshness for scripting

What it shows¶

Seven cards, each one layer of the stack:

Card	What it probes	Healthy means
Cloudflare Worker (otq-checkin)	`GET /health` + `GET /stats` on the deployed Worker	Worker is reachable; checkins/replies/summaries counters present
Telegram webhook	`getWebhookInfo` from Telegram Bot API	Webhook URL set, 0-5 pending updates, no recent error
Completion log (local)	`completion_log.jsonl` mtime + line count + sources	File exists, ≤36h old, populated
GarminDB	`garmin_activities.db` row count + last activity timestamp	DB present, last activity ≤36h old
Pace cache	`warehouse_cache.db` algo version + last-computed time	Cache populated, last build ≤72h old
KB (research corpus)	`kb.duckdb` row counts (findings, claims, studies)	DB present, file ≤14d old
intervals.icu CSV	`~/HealthData/icu_activities.csv` mtime + line count	CSV present, ≤48h old

Each card has a status dot (green / amber / red / grey-unknown), a one-line summary, key/value metrics, and the last-error string if a probe surfaced one. Worker card is click-through to the deployed worker URL.

The page header shows an overall band — green only if all cards green; amber if any warn; red if any alert; grey if everything is unknown.

Architecture¶

The page is a single FastAPI route + a JSON endpoint, both reading the same in-memory cache (TTL 30s). Implemented in:

ui/status_probes.py — one function per probe, returns a standard dict
ui/app.py — /status (HTML) + /status/cards (HTMX partial) routes
ui/templates/status.html — page shell
ui/templates/_partials/status_cards.html — cards grid (auto-polled)

Each probe wraps its upstream call in a 3s timeout + try/except. If one upstream is unreachable, that one card goes red; the rest of the page still renders.

Adding a new probe¶

Three lines of code:

Write a function in ui/status_probes.py returning the standard dict (see existing probes for the shape).
Append it to ALL_PROBES at the bottom of that file.
Reload uvicorn: launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui

No template changes — cards render generically from the dict shape.

The bottom-of-every-page freshness widget (ui/freshness.py) is a different view of different data. It reads logs/runs.jsonl to surface last-run-age for the gated layers (daily_sync, kb_load, kb_embed, ui). It's an at-a-glance "are scheduled jobs firing" indicator.

/status is a deeper, present-tense per-layer health view that probes everything live. The two complement each other: widget for "did the overnight work happen?", /status for "is everything right now up?"

Each freshness-widget dot links to /status for the deeper view.

When `/status` itself is the problem¶

If /status returns 500 or some probe consistently crashes:

Check ~/garmin-warehouse/logs/uvicorn.err.log for tracebacks
Hit /freshness JSON to see if the freshness layer is the issue
Force-reload uvicorn: launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui

If a single probe is the culprit, comment it out of ALL_PROBES in status_probes.py and reload. The page will render with the remaining probes; fix the broken one separately.

runbooks/tunnel-recovery.md — when the whole site is down (vs one layer red on /status)
reference/cron-schedules.md — the scheduled jobs whose freshness this dashboard reflects
systems/otq-checkin-worker.md — the Worker the Cloudflare card probes

/status — System Health Dashboard¶

What it shows¶

Architecture¶

Adding a new probe¶

Relationship to the freshness widget¶

When /status itself is the problem¶

Related¶

`/status` — System Health Dashboard¶

When `/status` itself is the problem¶