Skip to content

Observability

Logging

  • Adopt structured logs with a correlation/request ID propagated across services.
  • Include client_id, job_id, user_id (where appropriate, non-PII), and provider.

Metrics (examples)

  • HTTP: req/sec, p95 latency, error rate (4xx/5xx) per endpoint.
  • Jobs: queue depth, time-to-complete, success/failure ratio.
  • Redis: command latency, error rate, CPU/memory.
  • Docs build: build time, success/failure counts.

Alerts

  • CPM/External API: 5xx > 2% for 5 min.
  • Redis: p95 latency > 100ms for 5 min; error rate > 1%.
  • Docs: build workflow failures on main/tags.

Dashboards

  • API overview, Job lifecycle, Redis health, Deployment status.