Skip to content

ADR-0019: Platform Services Lifecycle

Accepted

floe operates two fundamentally different types of workloads:

  1. Long-lived services - Orchestrator UIs, catalog servers, semantic layer APIs
  2. Ephemeral jobs - dbt runs, pipeline executions, data quality checks

Without clear lifecycle boundaries:

  • Teams conflate deployment strategies
  • State management becomes unclear
  • Scaling approaches are inappropriate
  • Debugging is difficult

Define distinct lifecycle models for platform services vs pipeline jobs.

Platform services run continuously and are managed by Platform Team.

CharacteristicValue
K8s ResourceDeployment, StatefulSet
LifecycleLong-lived, upgraded in place
StateStateful (databases, caches)
ScalingFixed replicas or HPA
Deploymentfloe platform deploy
UpgradesRolling updates, blue-green
OwnerPlatform Team

Services:

ServiceK8s ResourceStatePurpose
Dagster webserverDeploymentPostgreSQLOrchestrator UI
Dagster daemonDeploymentPostgreSQLJob scheduling
Polaris serverDeploymentPostgreSQLIceberg catalog
Cube serverDeploymentRedisSemantic layer API
OTLP CollectorDeploymentNoneTelemetry collection
PrometheusStatefulSetPVCMetrics storage
GrafanaDeploymentPVCDashboards
MinIOStatefulSetPVCObject storage

Pipeline jobs run to completion and are triggered by the orchestrator.

CharacteristicValue
K8s ResourceJob
LifecycleRun-to-completion
StateStateless
ScalingOne pod per execution
DeploymentTriggered by orchestrator
RetriesHandled by orchestrator
OwnerData Team (execution), Platform Team (infrastructure)

Jobs:

Job TypeTriggerDurationOutput
dbt runSchedule/manualMinutesIceberg tables
dbt testPost-runSecondsTest results
dlt ingestionScheduleMinutesRaw tables
Quality checksPost-runSecondsPass/fail
  • Clear ownership - Platform Team owns services, Data Team triggers jobs
  • Appropriate scaling - Services use HPA, jobs scale per-execution
  • State clarity - Services manage state, jobs are stateless
  • Debugging - Different tooling per layer (kubectl logs vs job history)
  • Complexity - Two deployment models to understand
  • Coordination - Services must be up before jobs can run
  • Resource planning - Different capacity planning per layer
  • Orchestrator bridges the gap (service that manages jobs)
  • Both layers use standard K8s resources
Terminal window
# Deploy all platform services
floe platform deploy
# Deploy specific service
floe platform deploy --component=orchestrator
# Upgrade services
floe platform upgrade --version=1.3.0
# Check service health
floe platform status
# View service logs
floe platform logs orchestrator
floe platform logs catalog
Terminal window
# Trigger pipeline run (creates K8s Job)
floe run # planned root data-team command; not alpha-supported yet
# View job status
floe status
# View job logs
floe logs --run-id=abc123
# Jobs are also visible via orchestrator UI
┌─────────────────────────────────────────────────────────────────────────┐
│ PIPELINE JOBS (Layer 4) - Ephemeral │
│ │
│ ┌───────────────────────────────────────────────────────────────────┐ │
│ │ dbt-run-abc123 (Job) │ │
│ │ └─ Connects to: compute, catalog, OTLP │ │
│ │ │ │
│ │ dlt-ingest-xyz789 (Job) │ │
│ │ └─ Connects to: catalog, OTLP │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────┬──────────────────────────────────────┘
│ Requires
┌─────────────────────────────────────────────────────────────────────────┐
│ PLATFORM SERVICES (Layer 3) - Long-lived │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Orchestrator│ │ Catalog │ │ Semantic │ │ Observability│ │
│ │ Dagster │ │ Polaris │ │ Cube │ │ OTLP │ │
│ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │
│ │ │Websvr │ │ │ │Server │ │ │ │Server │ │ │ │Collect│ │ │
│ │ │Daemon │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ └───┬───┘ │ │ └───┬───┘ │ │ └───┬───┘ │ │ └───┬───┘ │ │
│ │ │ │ │ │ │ │ │ │ │ │ │ │
│ │ ┌───▼───┐ │ │ ┌───▼───┐ │ │ ┌───▼───┐ │ │ │ │ │
│ │ │ PG │ │ │ │ PG │ │ │ │ Redis │ │ │ │ │ │
│ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └──────│──────┘ │
│ │ │
│ ┌──────────────────────────────────────────────────────────│───────┐ │
│ │ STORAGE │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ │ │ │
│ │ │ MinIO │ │ OCI Reg │ ┌─────▼─────┐ │ │
│ │ │ (objects) │ │ (artifacts) │ │Prometheus │ │ │
│ │ └─────────────┘ └─────────────┘ │ Grafana │ │ │
│ │ └───────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘

Platform services must start in dependency order:

charts/floe-platform/templates/startup-order.yaml
1. Storage (MinIO, PostgreSQL instances)
└─ Wait: StatefulSet ready
2. Catalog (Polaris)
└─ Wait: Deployment ready, healthcheck passes
3. Observability (OTLP Collector, Prometheus)
└─ Wait: Deployment ready
4. Orchestrator (Dagster webserver, daemon)
└─ Wait: Deployment ready, code locations loaded
5. Semantic Layer (Cube)
└─ Wait: Deployment ready, schema loaded
# Kubernetes probes for long-lived services
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
# Jobs don't use probes - they run to completion
# Health is determined by exit code
spec:
backoffLimit: 0 # Fail fast, let orchestrator handle retries
ttlSecondsAfterFinished: 3600 # Cleanup after 1 hour
# Sized for steady-state operation
resources:
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
# Sized for workload, determined by ComputePlugin
resources:
requests:
cpu: 1000m # Higher for computation
memory: 2Gi # Higher for data processing
limits:
cpu: 4000m
memory: 8Gi