Ops
Runbooks
Runbooks
Runbooks
This document provides runbooks for common operational procedures including releases, incidents, backups, and upgrades.
Release Train Cut
We follow a structured release train approach:
-
Trigger Release Please
- Merge
maininto release branch. - Let Release Please generate changelog + bump version.
- Merge
-
Tag & Artifacts
- CI tags repo with version (e.g.
v1.2.3). - CI builds container images and pushes to registry.
- CI tags repo with version (e.g.
-
Docs Versioning
- Use
miketo version and publish docs:mike deploy 1.2 latest mike set-default latest git push origin gh-pages
- Use
Incident Response
SSE Stream Stuck
- Check API logs for errors around
/stream. - Restart affected API pods.
- Validate Temporal queue health.
Worker Crash
- Inspect worker pod/container logs.
- Restart pod in Kubernetes.
- Verify reconnection to Temporal task queue.
Queue Backlog
- Monitor Temporal task queue metrics.
- Scale worker deployments (HPA or manual).
- Prioritize critical task queues during incident.
Backups & Restores
Postgres
- Backups: cron
pg_dumpto S3. - Restore:
psql < dump.sqlinto new instance.
MinIO (S3)
- Backups: versioned buckets enabled.
- Restore: use
mc(MinIO client) to copy objects back by version.
Temporal Upgrade / Versioning Checklist
- Upgrade Temporal cluster version-by-version (no skips).
- Validate worker build-id versioning to allow rolling upgrades:
- Always register new build-id.
- Maintain backwards-compatible workflows.
- Run canaries to validate new version.
- Update CLI tools and SDKs in sync with Temporal release.