Observability
Logging
- Adopt structured logs with a correlation/request ID propagated across services.
- Include client_id, job_id, user_id (where appropriate, non-PII), and provider.
Metrics (examples)
- HTTP: req/sec, p95 latency, error rate (4xx/5xx) per endpoint.
- Jobs: queue depth, time-to-complete, success/failure ratio.
- Redis: command latency, error rate, CPU/memory.
- Docs build: build time, success/failure counts.
Alerts
- CPM/External API: 5xx > 2% for 5 min.
- Redis: p95 latency > 100ms for 5 min; error rate > 1%.
- Docs: build workflow failures on main/tags.
Dashboards
- API overview, Job lifecycle, Redis health, Deployment status.