Ops
Service Level Objectives (SLOs)
Service Level Objectives (SLOs)
Service Level Objectives (SLOs)
This document outlines our defined SLOs, alerting thresholds, and how we validate system performance under load.
SLO Definitions
-
Run Start Latency (p95)
- Target: ≤ 2s to enqueue a workflow run after API
POST /runs. - Alert: p95 latency > 5s sustained for 5 minutes.
- Target: ≤ 2s to enqueue a workflow run after API
-
Run Success Rate
- Target: ≥ 99% of runs succeed end-to-end.
- Alert: success rate < 95% for 10 minutes.
-
Stream Gaps (SSE)
- Target: zero dropped or out-of-order events.
- Alert: >10 SSE gaps/minute per instance.
Alerts & Thresholds
- Run start latency: Alert
critical>5s,warning>3s. - Run success rate: Alert
critical<95%,warning<98%. - Stream gap rate: Alert
critical>50/min,warning>10/min.
Load Testing
We use load tests (e.g. Locust, k6) to validate SLOs at scale:
- Ramp up to 1000 concurrent runs.
- Measure latency distribution, stream continuity, and worker throughput.
- Compare against defined targets.
Backpressure Behavior
When the system is overloaded:
- API returns
429 Too Many RequestswithRetry-Afterheader. - Clients must back off and retry after suggested interval.
- This ensures Temporal queues and worker pools are not overwhelmed.
Notes
- SLO compliance is reviewed quarterly.
- Dashboards and alerting rules in Prometheus/Grafana enforce these thresholds.
- SLO outcomes feed into error budgets for operational planning.