ADR-0019: Platform Services Lifecycle
Status
Section titled “Status”Accepted
Context
Section titled “Context”floe operates two fundamentally different types of workloads:
- Long-lived services - Orchestrator UIs, catalog servers, semantic layer APIs
- Ephemeral jobs - dbt runs, pipeline executions, data quality checks
Without clear lifecycle boundaries:
- Teams conflate deployment strategies
- State management becomes unclear
- Scaling approaches are inappropriate
- Debugging is difficult
Decision
Section titled “Decision”Define distinct lifecycle models for platform services vs pipeline jobs.
Layer 3: Platform Services (Long-lived)
Section titled “Layer 3: Platform Services (Long-lived)”Platform services run continuously and are managed by Platform Team.
| Characteristic | Value |
|---|---|
| K8s Resource | Deployment, StatefulSet |
| Lifecycle | Long-lived, upgraded in place |
| State | Stateful (databases, caches) |
| Scaling | Fixed replicas or HPA |
| Deployment | floe platform deploy |
| Upgrades | Rolling updates, blue-green |
| Owner | Platform Team |
Services:
| Service | K8s Resource | State | Purpose |
|---|---|---|---|
| Dagster webserver | Deployment | PostgreSQL | Orchestrator UI |
| Dagster daemon | Deployment | PostgreSQL | Job scheduling |
| Polaris server | Deployment | PostgreSQL | Iceberg catalog |
| Cube server | Deployment | Redis | Semantic layer API |
| OTLP Collector | Deployment | None | Telemetry collection |
| Prometheus | StatefulSet | PVC | Metrics storage |
| Grafana | Deployment | PVC | Dashboards |
| MinIO | StatefulSet | PVC | Object storage |
Layer 4: Pipeline Jobs (Ephemeral)
Section titled “Layer 4: Pipeline Jobs (Ephemeral)”Pipeline jobs run to completion and are triggered by the orchestrator.
| Characteristic | Value |
|---|---|
| K8s Resource | Job |
| Lifecycle | Run-to-completion |
| State | Stateless |
| Scaling | One pod per execution |
| Deployment | Triggered by orchestrator |
| Retries | Handled by orchestrator |
| Owner | Data Team (execution), Platform Team (infrastructure) |
Jobs:
| Job Type | Trigger | Duration | Output |
|---|---|---|---|
| dbt run | Schedule/manual | Minutes | Iceberg tables |
| dbt test | Post-run | Seconds | Test results |
| dlt ingestion | Schedule | Minutes | Raw tables |
| Quality checks | Post-run | Seconds | Pass/fail |
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Clear ownership - Platform Team owns services, Data Team triggers jobs
- Appropriate scaling - Services use HPA, jobs scale per-execution
- State clarity - Services manage state, jobs are stateless
- Debugging - Different tooling per layer (
kubectl logsvs job history)
Negative
Section titled “Negative”- Complexity - Two deployment models to understand
- Coordination - Services must be up before jobs can run
- Resource planning - Different capacity planning per layer
Neutral
Section titled “Neutral”- Orchestrator bridges the gap (service that manages jobs)
- Both layers use standard K8s resources
Deployment Commands
Section titled “Deployment Commands”Platform Services (Layer 3)
Section titled “Platform Services (Layer 3)”# Deploy all platform servicesfloe platform deploy
# Deploy specific servicefloe platform deploy --component=orchestrator
# Upgrade servicesfloe platform upgrade --version=1.3.0
# Check service healthfloe platform status
# View service logsfloe platform logs orchestratorfloe platform logs catalogPipeline Jobs (Layer 4)
Section titled “Pipeline Jobs (Layer 4)”# Trigger pipeline run (creates K8s Job)floe run # planned root data-team command; not alpha-supported yet
# View job statusfloe status
# View job logsfloe logs --run-id=abc123
# Jobs are also visible via orchestrator UIService Dependencies
Section titled “Service Dependencies”┌─────────────────────────────────────────────────────────────────────────┐│ PIPELINE JOBS (Layer 4) - Ephemeral ││ ││ ┌───────────────────────────────────────────────────────────────────┐ ││ │ dbt-run-abc123 (Job) │ ││ │ └─ Connects to: compute, catalog, OTLP │ ││ │ │ ││ │ dlt-ingest-xyz789 (Job) │ ││ │ └─ Connects to: catalog, OTLP │ ││ └───────────────────────────────────────────────────────────────────┘ │└──────────────────────────────────┬──────────────────────────────────────┘ │ Requires ▼┌─────────────────────────────────────────────────────────────────────────┐│ PLATFORM SERVICES (Layer 3) - Long-lived ││ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ Orchestrator│ │ Catalog │ │ Semantic │ │ Observability│ ││ │ Dagster │ │ Polaris │ │ Cube │ │ OTLP │ ││ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ │ ┌───────┐ │ ││ │ │Websvr │ │ │ │Server │ │ │ │Server │ │ │ │Collect│ │ ││ │ │Daemon │ │ │ │ │ │ │ │ │ │ │ │ │ │ ││ │ └───┬───┘ │ │ └───┬───┘ │ │ └───┬───┘ │ │ └───┬───┘ │ ││ │ │ │ │ │ │ │ │ │ │ │ │ ││ │ ┌───▼───┐ │ │ ┌───▼───┐ │ │ ┌───▼───┐ │ │ │ │ ││ │ │ PG │ │ │ │ PG │ │ │ │ Redis │ │ │ │ │ ││ │ └───────┘ │ │ └───────┘ │ │ └───────┘ │ │ │ │ ││ └─────────────┘ └─────────────┘ └─────────────┘ └──────│──────┘ ││ │ ││ ┌──────────────────────────────────────────────────────────│───────┐ ││ │ STORAGE │ │ ││ │ ┌─────────────┐ ┌─────────────┐ │ │ ││ │ │ MinIO │ │ OCI Reg │ ┌─────▼─────┐ │ ││ │ │ (objects) │ │ (artifacts) │ │Prometheus │ │ ││ │ └─────────────┘ └─────────────┘ │ Grafana │ │ ││ │ └───────────┘ │ ││ └───────────────────────────────────────────────────────────────────┘ │└─────────────────────────────────────────────────────────────────────────┘Startup Order
Section titled “Startup Order”Platform services must start in dependency order:
1. Storage (MinIO, PostgreSQL instances) └─ Wait: StatefulSet ready
2. Catalog (Polaris) └─ Wait: Deployment ready, healthcheck passes
3. Observability (OTLP Collector, Prometheus) └─ Wait: Deployment ready
4. Orchestrator (Dagster webserver, daemon) └─ Wait: Deployment ready, code locations loaded
5. Semantic Layer (Cube) └─ Wait: Deployment ready, schema loadedHealth Checks
Section titled “Health Checks”Service Health (Layer 3)
Section titled “Service Health (Layer 3)”# Kubernetes probes for long-lived serviceslivenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 30 periodSeconds: 10
readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5Job Health (Layer 4)
Section titled “Job Health (Layer 4)”# Jobs don't use probes - they run to completion# Health is determined by exit codespec: backoffLimit: 0 # Fail fast, let orchestrator handle retries ttlSecondsAfterFinished: 3600 # Cleanup after 1 hourResource Allocation
Section titled “Resource Allocation”Services (Continuous)
Section titled “Services (Continuous)”# Sized for steady-state operationresources: requests: cpu: 500m memory: 512Mi limits: cpu: 1000m memory: 1GiJobs (Burst)
Section titled “Jobs (Burst)”# Sized for workload, determined by ComputePluginresources: requests: cpu: 1000m # Higher for computation memory: 2Gi # Higher for data processing limits: cpu: 4000m memory: 8GiReferences
Section titled “References”- ADR-0016: Platform Enforcement Architecture - Four-layer architecture
- ADR-0017: K8s Testing Infrastructure - Testing approach
- Deployment View - Deployment overview
- Production - Production deployment details