Production Considerations
This document captures production hardening considerations for future Floe deployments. It is not an alpha-supported production runbook, and the patterns below have not been validated as part of the current release lane.
For the current alpha, use Kubernetes Helm and Capability Status to distinguish supported deployment paths from planned production operations.
1. High Availability Considerations
Section titled “1. High Availability Considerations”+---------------------------------------------------------------------------+| HIGH AVAILABILITY SETUP || || +---------------------------------------------------------------------+ || | Load Balancer | || | +-- health checks: /health | || +----------------------------------+----------------------------------+ || | || +------------------------+------------------------+ || v v v || +--------------+ +--------------+ +--------------+ || | webserver | | webserver | | webserver | || | (zone-a) | | (zone-b) | | (zone-c) | || +--------------+ +--------------+ +--------------+ || | || v || +-----------------------------+ || | PostgreSQL (Multi-AZ RDS) | || | +-- automatic failover | || +-----------------------------+ |+---------------------------------------------------------------------------+2. Scaling Considerations
Section titled “2. Scaling Considerations”| Workload | Scaling Strategy |
|---|---|
| Light (< 100 runs/day) | 1 webserver, 1 daemon, 2 workers |
| Medium (100-1000 runs/day) | 2 webservers, 1 daemon, 5 workers |
| Heavy (1000+ runs/day) | 3 webservers, 1 daemon, 10+ workers, queue partitioning |
3. Backup Strategy Considerations
Section titled “3. Backup Strategy Considerations”# CronJob for PostgreSQL backupsapiVersion: batch/v1kind: CronJobmetadata: name: dagster-backupspec: schedule: "0 */6 * * *" # Every 6 hours jobTemplate: spec: template: spec: containers: - name: backup image: postgres:16 command: - /bin/sh - -c - | pg_dump -h $PGHOST -U $PGUSER -d dagster | \ gzip | \ aws s3 cp - s3://backups/dagster/$(date +%Y%m%d-%H%M%S).sql.gz envFrom: - secretRef: name: dagster-postgresql4. Monitoring Considerations
Section titled “4. Monitoring Considerations”# ServiceMonitor for PrometheusapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: dagsterspec: selector: matchLabels: app: dagster endpoints: - port: http path: /metrics interval: 30sKey Metrics:
| Metric | Alert Threshold |
|---|---|
dagster_runs_failed_total | > 5 in 1 hour |
dagster_runs_duration_seconds | p99 > 3600s |
dagster_daemon_heartbeat_age | > 60s |
container_memory_usage_bytes | > 90% limit |
5. Pod Disruption Budget Considerations
Section titled “5. Pod Disruption Budget Considerations”PDBs ensure service availability during cluster maintenance:
---apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: dagster-webserver-pdb namespace: floespec: minAvailable: 1 selector: matchLabels: app: dagster component: webserver---apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: dagster-worker-pdb namespace: floespec: minAvailable: 2 selector: matchLabels: app: dagster component: worker---apiVersion: policy/v1kind: PodDisruptionBudgetmetadata: name: polaris-pdb namespace: floespec: minAvailable: 1 selector: matchLabels: app: polaris| Component | PDB Setting | Min Replicas | Notes |
|---|---|---|---|
| dagster-webserver | minAvailable: 1 | 2 | UI availability |
| dagster-daemon | None | 1 | See HA section below |
| dagster-worker | minAvailable: 2 | 3 | Maintains job throughput |
| polaris | minAvailable: 1 | 2 | Catalog availability |
| marquez | minAvailable: 1 | 2 | Lineage availability |
6. Dagster Daemon High Availability Considerations
Section titled “6. Dagster Daemon High Availability Considerations”The current alpha chart deploys a single Dagster daemon through the Dagster subchart. Configure alpha daemon enablement and resources with the actual chart values:
dagster: dagsterDaemon: enabled: true resources: requests: cpu: 100m memory: 256Mi limits: cpu: 500m memory: 512MiHA daemon operation and leader election are future/candidate production patterns. Floe does not currently expose a daemon.mode manifest contract or chart value for switching between single-daemon and HA modes.
Current Alpha: Single Daemon
Section titled “Current Alpha: Single Daemon”Single daemon instance with fast recovery:
+---------------------------------------------------------------+| Daemon Pod (single instance) || || +------------------------------------------------------------+|| | dagster-daemon container ||| | * Runs scheduler, sensors, run launcher ||| | * Heartbeat written to PostgreSQL every 30s ||| | * K8s restarts pod on failure (< 60s recovery) ||| +------------------------------------------------------------+|| || livenessProbe: || exec: ["dagster", "daemon", "liveness-check"] || periodSeconds: 30 || failureThreshold: 2 |+---------------------------------------------------------------+Future Candidate: HA Leader Election
Section titled “Future Candidate: HA Leader Election”An active-passive configuration using K8s lease-based leader election is a candidate production hardening pattern. It has not been validated as part of the alpha chart and is not currently implemented as a Floe chart value.
+---------------------------------------------------------------+| Daemon Pods (2 replicas, 1 active) || || +-------------------------+ +-------------------------+ || | dagster-daemon-0 | | dagster-daemon-1 | || | (LEADER - active) | | (STANDBY - idle) | || | * Holds K8s Lease | | * Watches Lease | || | * Runs all services | | * Ready to take over | || +------------+------------+ +-------------------------+ || | || v || +------------------------------------------------------------+|| | K8s Lease: dagster-daemon-leader ||| | holderIdentity: dagster-daemon-0 ||| | leaseDurationSeconds: 15 ||| | renewTime: 2026-01-03T10:30:00Z ||| +------------------------------------------------------------+|+---------------------------------------------------------------+Failover Behavior
Section titled “Failover Behavior”| Event | Recovery Time | Behavior |
|---|---|---|
| Pod crash (single) | < 60s | K8s restarts pod, daemon resumes |
| Pod crash (HA) | < 15s | Standby acquires lease, becomes leader |
| Node drain (single) | During drain | Pod evicted, recreated on new node |
| Node drain (HA) | < 15s | Standby on different node takes over |
Daemon State Persistence
Section titled “Daemon State Persistence”The daemon persists all state to PostgreSQL, allowing recovery without data loss:
| State | Storage | Recovery |
|---|---|---|
| Schedules | PostgreSQL schedules table | Automatic on restart |
| Sensors | PostgreSQL instigators table | Automatic on restart |
| Run queue | PostgreSQL runs table | Resumes queued runs |
| Heartbeat | PostgreSQL daemon_heartbeats table | New heartbeat on startup |
Monitoring
Section titled “Monitoring”# Alert on daemon unavailability- alert: DagsterDaemonUnavailable expr: dagster_daemon_heartbeat_age_seconds > 120 for: 2m labels: severity: critical annotations: summary: "Dagster daemon heartbeat stale"Planning Matrix
Section titled “Planning Matrix”| Environment | Mode | Rationale |
|---|---|---|
| Development | single daemon | Simpler, sufficient for dev |
| Staging | single daemon | Test production-like recovery |
| Future production (small) | single daemon | Candidate pattern; validate before adopting |
| Future production (critical) | HA leader election | Candidate pattern for sub-15s failover requirements |
Related Documentation
Section titled “Related Documentation”- Kubernetes Helm - Base Helm deployment
- Data Mesh - Multi-domain deployment
- Capability Status - Current alpha capability boundaries