/status — System Health Dashboard¶
Live per-layer health for every system the warehouse depends on.
- URL:
https://warehouse.caseymanos.com/status(Access-gated) - Polling: HTMX auto-refresh every 30s (just the cards swap)
- JSON:
https://warehouse.caseymanos.com/freshnessfor scripting
What it shows¶
Seven cards, each one layer of the stack:
| Card | What it probes | Healthy means |
|---|---|---|
| Cloudflare Worker (otq-checkin) | GET /health + GET /stats on the deployed Worker |
Worker is reachable; checkins/replies/summaries counters present |
| Telegram webhook | getWebhookInfo from Telegram Bot API |
Webhook URL set, 0-5 pending updates, no recent error |
| Completion log (local) | completion_log.jsonl mtime + line count + sources |
File exists, ≤36h old, populated |
| GarminDB | garmin_activities.db row count + last activity timestamp |
DB present, last activity ≤36h old |
| Pace cache | warehouse_cache.db algo version + last-computed time |
Cache populated, last build ≤72h old |
| KB (research corpus) | kb.duckdb row counts (findings, claims, studies) |
DB present, file ≤14d old |
| intervals.icu CSV | ~/HealthData/icu_activities.csv mtime + line count |
CSV present, ≤48h old |
Each card has a status dot (green / amber / red / grey-unknown), a one-line summary, key/value metrics, and the last-error string if a probe surfaced one. Worker card is click-through to the deployed worker URL.
The page header shows an overall band — green only if all cards green; amber if any warn; red if any alert; grey if everything is unknown.
Architecture¶
The page is a single FastAPI route + a JSON endpoint, both reading the same in-memory cache (TTL 30s). Implemented in:
ui/status_probes.py— one function per probe, returns a standard dictui/app.py—/status(HTML) +/status/cards(HTMX partial) routesui/templates/status.html— page shellui/templates/_partials/status_cards.html— cards grid (auto-polled)
Each probe wraps its upstream call in a 3s timeout + try/except. If one upstream is unreachable, that one card goes red; the rest of the page still renders.
Adding a new probe¶
Three lines of code:
- Write a function in
ui/status_probes.pyreturning the standard dict (see existing probes for the shape). - Append it to
ALL_PROBESat the bottom of that file. - Reload uvicorn:
launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui
No template changes — cards render generically from the dict shape.
Relationship to the freshness widget¶
The bottom-of-every-page freshness widget (ui/freshness.py) is a
different view of different data. It reads logs/runs.jsonl to surface
last-run-age for the gated layers (daily_sync, kb_load, kb_embed,
ui). It's an at-a-glance "are scheduled jobs firing" indicator.
/status is a deeper, present-tense per-layer health view that probes
everything live. The two complement each other: widget for "did the
overnight work happen?", /status for "is everything right now up?"
Each freshness-widget dot links to /status for the deeper view.
When /status itself is the problem¶
If /status returns 500 or some probe consistently crashes:
- Check
~/garmin-warehouse/logs/uvicorn.err.logfor tracebacks - Hit
/freshnessJSON to see if the freshness layer is the issue - Force-reload uvicorn:
launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui
If a single probe is the culprit, comment it out of ALL_PROBES in
status_probes.py and reload. The page will render with the remaining
probes; fix the broken one separately.
Related¶
runbooks/tunnel-recovery.md— when the whole site is down (vs one layer red on/status)reference/cron-schedules.md— the scheduled jobs whose freshness this dashboard reflectssystems/otq-checkin-worker.md— the Worker the Cloudflare card probes