Skip to content

/status — System Health Dashboard

Live per-layer health for every system the warehouse depends on.

  • URL: https://warehouse.caseymanos.com/status (Access-gated)
  • Polling: HTMX auto-refresh every 30s (just the cards swap)
  • JSON: https://warehouse.caseymanos.com/freshness for scripting

What it shows

Seven cards, each one layer of the stack:

Card What it probes Healthy means
Cloudflare Worker (otq-checkin) GET /health + GET /stats on the deployed Worker Worker is reachable; checkins/replies/summaries counters present
Telegram webhook getWebhookInfo from Telegram Bot API Webhook URL set, 0-5 pending updates, no recent error
Completion log (local) completion_log.jsonl mtime + line count + sources File exists, ≤36h old, populated
GarminDB garmin_activities.db row count + last activity timestamp DB present, last activity ≤36h old
Pace cache warehouse_cache.db algo version + last-computed time Cache populated, last build ≤72h old
KB (research corpus) kb.duckdb row counts (findings, claims, studies) DB present, file ≤14d old
intervals.icu CSV ~/HealthData/icu_activities.csv mtime + line count CSV present, ≤48h old

Each card has a status dot (green / amber / red / grey-unknown), a one-line summary, key/value metrics, and the last-error string if a probe surfaced one. Worker card is click-through to the deployed worker URL.

The page header shows an overall band — green only if all cards green; amber if any warn; red if any alert; grey if everything is unknown.

Architecture

The page is a single FastAPI route + a JSON endpoint, both reading the same in-memory cache (TTL 30s). Implemented in:

  • ui/status_probes.py — one function per probe, returns a standard dict
  • ui/app.py/status (HTML) + /status/cards (HTMX partial) routes
  • ui/templates/status.html — page shell
  • ui/templates/_partials/status_cards.html — cards grid (auto-polled)

Each probe wraps its upstream call in a 3s timeout + try/except. If one upstream is unreachable, that one card goes red; the rest of the page still renders.

Adding a new probe

Three lines of code:

  1. Write a function in ui/status_probes.py returning the standard dict (see existing probes for the shape).
  2. Append it to ALL_PROBES at the bottom of that file.
  3. Reload uvicorn: launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui

No template changes — cards render generically from the dict shape.

Relationship to the freshness widget

The bottom-of-every-page freshness widget (ui/freshness.py) is a different view of different data. It reads logs/runs.jsonl to surface last-run-age for the gated layers (daily_sync, kb_load, kb_embed, ui). It's an at-a-glance "are scheduled jobs firing" indicator.

/status is a deeper, present-tense per-layer health view that probes everything live. The two complement each other: widget for "did the overnight work happen?", /status for "is everything right now up?"

Each freshness-widget dot links to /status for the deeper view.

When /status itself is the problem

If /status returns 500 or some probe consistently crashes:

  1. Check ~/garmin-warehouse/logs/uvicorn.err.log for tracebacks
  2. Hit /freshness JSON to see if the freshness layer is the issue
  3. Force-reload uvicorn: launchctl kickstart -k gui/$(id -u)/com.casey.warehouse-ui

If a single probe is the culprit, comment it out of ALL_PROBES in status_probes.py and reload. The page will render with the remaining probes; fix the broken one separately.