ADR-0014: Flink for Streaming Workloads (Deferred)
Status
Section titled “Status”Deferred to V2.0
Context
Section titled “Context”Floe’s initial design focuses on batch-oriented data pipelines using dbt + warehouse compute (DuckDB, Snowflake, BigQuery). However, some use cases require near-real-time processing with sub-minute latency:
- CDC (Change Data Capture) streaming
- Real-time feature engineering for ML
- Event-driven transformations
- Continuous aggregations
Apache Flink is the leading open-source stream processing engine, with:
- Stateful processing - Windows, joins, aggregations
- Exactly-once semantics - Checkpointing with distributed snapshots
- SQL support - Flink SQL for declarative stream transformations
- Ecosystem - Integrates with Iceberg, Kafka, dbt (via dbt-flink adapter)
Current Scope
Section titled “Current Scope”MVP (V1.0): Batch processing only
- dbt compiles SQL to warehouse-native queries
- Dagster orchestrates scheduled runs
- No streaming workloads
Future (V2.0+): Streaming with Flink
- dbt SQL optionally compiles to Flink SQL
- Dagster manages Flink job lifecycle
- Continuous queries alongside batch
Decision
Section titled “Decision”Defer Flink streaming to V2.0 until:
- Customer demand for sub-minute latency emerges
- dbt-flink adapter matures (currently experimental)
- Team has capacity for Flink operational complexity
Design for extensibility now by:
- Keeping compute target abstraction flexible
- Designing Dagster integration to support both batch and streaming
- Documenting future Flink integration path
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Simpler MVP - No Flink deployment, state management, or checkpointing
- Reduced operational complexity - Batch workloads are easier to debug and operate
- Faster time to market - 90% of use cases covered by batch processing
- Mature tooling - dbt + warehouses are production-proven
Negative
Section titled “Negative”- No real-time processing - Users needing sub-minute latency must use external tools
- Future migration cost - Pipelines may need refactoring for streaming semantics
- Competitive gap - Some competitors offer streaming out-of-box
Neutral
Section titled “Neutral”- Batch processing (hourly/daily) covers most analytics use cases
- Streaming can be added without breaking existing pipelines
Future Design (V2.0)
Section titled “Future Design (V2.0)”Integration Architecture
Section titled “Integration Architecture”floe.yaml ↓floe-core Compiler ↓CompiledArtifacts ↓┌─────────────────────┐│ OrchestratorPlugin ││ (e.g., Dagster) │└──────────┬──────────┘ │ ┌────┴─────┐ ▼ ▼ ┌──────┐ ┌────────┐ │ dbt │ │ Flink │ │Batch │ │Streaming│ └──────┘ └────────┘Open Questions (To Be Resolved in V2.0)
Section titled “Open Questions (To Be Resolved in V2.0)”-
dbt SQL to Flink SQL compilation
- How do batch semantics map to streaming? (e.g.,
GROUP BY→ tumbling window?) - Which dbt features are unsupported in streaming mode?
- How do batch semantics map to streaming? (e.g.,
-
State management
- State backend: RocksDB vs in-memory?
- Savepoint strategy for GitOps deployments?
-
Deployment model
- Flink session cluster vs job cluster?
- Auto-scaling strategy for Flink TaskManagers?
-
Dagster integration
- Schedules (batch) vs always-on sensors (streaming)?
- How to handle streaming job restarts?
-
Observability
- Flink metrics → OpenTelemetry?
- Watermark lag monitoring?