Skip to content

Production Considerations

This document captures production hardening considerations for future Floe deployments. It is not an alpha-supported production runbook, and the patterns below have not been validated as part of the current release lane.

For the current alpha, use Kubernetes Helm and Capability Status to distinguish supported deployment paths from planned production operations.


+---------------------------------------------------------------------------+
| HIGH AVAILABILITY SETUP |
| |
| +---------------------------------------------------------------------+ |
| | Load Balancer | |
| | +-- health checks: /health | |
| +----------------------------------+----------------------------------+ |
| | |
| +------------------------+------------------------+ |
| v v v |
| +--------------+ +--------------+ +--------------+ |
| | webserver | | webserver | | webserver | |
| | (zone-a) | | (zone-b) | | (zone-c) | |
| +--------------+ +--------------+ +--------------+ |
| | |
| v |
| +-----------------------------+ |
| | PostgreSQL (Multi-AZ RDS) | |
| | +-- automatic failover | |
| +-----------------------------+ |
+---------------------------------------------------------------------------+

WorkloadScaling Strategy
Light (< 100 runs/day)1 webserver, 1 daemon, 2 workers
Medium (100-1000 runs/day)2 webservers, 1 daemon, 5 workers
Heavy (1000+ runs/day)3 webservers, 1 daemon, 10+ workers, queue partitioning

# CronJob for PostgreSQL backups
apiVersion: batch/v1
kind: CronJob
metadata:
name: dagster-backup
spec:
schedule: "0 */6 * * *" # Every 6 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: postgres:16
command:
- /bin/sh
- -c
- |
pg_dump -h $PGHOST -U $PGUSER -d dagster | \
gzip | \
aws s3 cp - s3://backups/dagster/$(date +%Y%m%d-%H%M%S).sql.gz
envFrom:
- secretRef:
name: dagster-postgresql

# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: dagster
spec:
selector:
matchLabels:
app: dagster
endpoints:
- port: http
path: /metrics
interval: 30s

Key Metrics:

MetricAlert Threshold
dagster_runs_failed_total> 5 in 1 hour
dagster_runs_duration_secondsp99 > 3600s
dagster_daemon_heartbeat_age> 60s
container_memory_usage_bytes> 90% limit

PDBs ensure service availability during cluster maintenance:

---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dagster-webserver-pdb
namespace: floe
spec:
minAvailable: 1
selector:
matchLabels:
app: dagster
component: webserver
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: dagster-worker-pdb
namespace: floe
spec:
minAvailable: 2
selector:
matchLabels:
app: dagster
component: worker
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: polaris-pdb
namespace: floe
spec:
minAvailable: 1
selector:
matchLabels:
app: polaris
ComponentPDB SettingMin ReplicasNotes
dagster-webserverminAvailable: 12UI availability
dagster-daemonNone1See HA section below
dagster-workerminAvailable: 23Maintains job throughput
polarisminAvailable: 12Catalog availability
marquezminAvailable: 12Lineage availability

6. Dagster Daemon High Availability Considerations

Section titled “6. Dagster Daemon High Availability Considerations”

The current alpha chart deploys a single Dagster daemon through the Dagster subchart. Configure alpha daemon enablement and resources with the actual chart values:

charts/floe-platform/values.yaml
dagster:
dagsterDaemon:
enabled: true
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi

HA daemon operation and leader election are future/candidate production patterns. Floe does not currently expose a daemon.mode manifest contract or chart value for switching between single-daemon and HA modes.

Single daemon instance with fast recovery:

+---------------------------------------------------------------+
| Daemon Pod (single instance) |
| |
| +------------------------------------------------------------+|
| | dagster-daemon container ||
| | * Runs scheduler, sensors, run launcher ||
| | * Heartbeat written to PostgreSQL every 30s ||
| | * K8s restarts pod on failure (< 60s recovery) ||
| +------------------------------------------------------------+|
| |
| livenessProbe: |
| exec: ["dagster", "daemon", "liveness-check"] |
| periodSeconds: 30 |
| failureThreshold: 2 |
+---------------------------------------------------------------+

An active-passive configuration using K8s lease-based leader election is a candidate production hardening pattern. It has not been validated as part of the alpha chart and is not currently implemented as a Floe chart value.

+---------------------------------------------------------------+
| Daemon Pods (2 replicas, 1 active) |
| |
| +-------------------------+ +-------------------------+ |
| | dagster-daemon-0 | | dagster-daemon-1 | |
| | (LEADER - active) | | (STANDBY - idle) | |
| | * Holds K8s Lease | | * Watches Lease | |
| | * Runs all services | | * Ready to take over | |
| +------------+------------+ +-------------------------+ |
| | |
| v |
| +------------------------------------------------------------+|
| | K8s Lease: dagster-daemon-leader ||
| | holderIdentity: dagster-daemon-0 ||
| | leaseDurationSeconds: 15 ||
| | renewTime: 2026-01-03T10:30:00Z ||
| +------------------------------------------------------------+|
+---------------------------------------------------------------+
EventRecovery TimeBehavior
Pod crash (single)< 60sK8s restarts pod, daemon resumes
Pod crash (HA)< 15sStandby acquires lease, becomes leader
Node drain (single)During drainPod evicted, recreated on new node
Node drain (HA)< 15sStandby on different node takes over

The daemon persists all state to PostgreSQL, allowing recovery without data loss:

StateStorageRecovery
SchedulesPostgreSQL schedules tableAutomatic on restart
SensorsPostgreSQL instigators tableAutomatic on restart
Run queuePostgreSQL runs tableResumes queued runs
HeartbeatPostgreSQL daemon_heartbeats tableNew heartbeat on startup
# Alert on daemon unavailability
- alert: DagsterDaemonUnavailable
expr: dagster_daemon_heartbeat_age_seconds > 120
for: 2m
labels:
severity: critical
annotations:
summary: "Dagster daemon heartbeat stale"
EnvironmentModeRationale
Developmentsingle daemonSimpler, sufficient for dev
Stagingsingle daemonTest production-like recovery
Future production (small)single daemonCandidate pattern; validate before adopting
Future production (critical)HA leader electionCandidate pattern for sub-15s failover requirements