My App
Ops

Runbooks

Runbooks

Runbooks

This document provides runbooks for common operational procedures including releases, incidents, backups, and upgrades.


Release Train Cut

We follow a structured release train approach:

  1. Trigger Release Please

    • Merge main into release branch.
    • Let Release Please generate changelog + bump version.
  2. Tag & Artifacts

    • CI tags repo with version (e.g. v1.2.3).
    • CI builds container images and pushes to registry.
  3. Docs Versioning

    • Use mike to version and publish docs:
      mike deploy 1.2 latest
      mike set-default latest
      git push origin gh-pages

Incident Response

SSE Stream Stuck

  • Check API logs for errors around /stream.
  • Restart affected API pods.
  • Validate Temporal queue health.

Worker Crash

  • Inspect worker pod/container logs.
  • Restart pod in Kubernetes.
  • Verify reconnection to Temporal task queue.

Queue Backlog

  • Monitor Temporal task queue metrics.
  • Scale worker deployments (HPA or manual).
  • Prioritize critical task queues during incident.

Backups & Restores

Postgres

  • Backups: cron pg_dump to S3.
  • Restore: psql < dump.sql into new instance.

MinIO (S3)

  • Backups: versioned buckets enabled.
  • Restore: use mc (MinIO client) to copy objects back by version.

Temporal Upgrade / Versioning Checklist

  • Upgrade Temporal cluster version-by-version (no skips).
  • Validate worker build-id versioning to allow rolling upgrades:
    • Always register new build-id.
    • Maintain backwards-compatible workflows.
  • Run canaries to validate new version.
  • Update CLI tools and SDKs in sync with Temporal release.