ADR-0010: Multi-Compute Pipeline Architecture
Status
Section titled “Status”Accepted (Revised)
Context
Section titled “Context”Floe users need to run data transformations on various compute targets:
- DuckDB - Embedded analytics, cost-effective, production-ready
- Snowflake - Enterprise data warehouse
- BigQuery - Google Cloud analytics
- Redshift - AWS data warehouse
- Databricks - Unified analytics platform
- Spark - Distributed big data processing
- PostgreSQL - Operational analytics
A single pipeline may need multiple compute engines for different steps:
- Step 1: Spark cluster (process 10TB raw data)
- Step 2: DuckDB pod (build analytical metrics on 100GB result)
This is NOT about dev vs prod (environment drift). This is about heterogeneous compute within a single pipeline, where each step uses the SAME compute across all environments.
Options considered:
- Single compute per platform - Platform selects one, all pipelines inherit
- Multi-compute with platform approval - Platform approves N, data engineers select per-step (CHOSEN)
- Unrestricted selection - Data engineers can use any compute
Decision
Section titled “Decision”Multi-compute with hierarchical governance. Platform teams approve N compute targets; data engineers select per-transform from the approved list.
Key Principles
Section titled “Key Principles”- Platform approves, data engineers select - Governance without bottlenecks
- Per-transform selection - Different steps can use different compute engines
- Environment parity preserved - Each step uses the SAME compute across dev/staging/prod
- Hierarchical inheritance - Enterprise → Domain → Product (Data Mesh support)
Environment Parity (No Drift)
Section titled “Environment Parity (No Drift)”Step 1: dev=Spark, staging=Spark, prod=Spark ✓ No driftStep 2: dev=DuckDB, staging=DuckDB, prod=DuckDB ✓ No driftWhat you test is what you deploy. Each transform uses the same compute across all environments.
Consequences
Section titled “Consequences”Positive
Section titled “Positive”- Flexible pipelines - Use the right compute for each step
- Cost optimization - Heavy processing on Spark, analytics on DuckDB
- Platform governance - Only approved computes can be used
- No environment drift - Each step consistent across environments
- Data Mesh-compatible governance primitives - Hierarchical governance (Enterprise → Domain → Product); validated Data Mesh operations remain planned. See Capability Status.
Negative
Section titled “Negative”- More configuration - Data engineers must specify compute (or use default)
- Multiple dbt profiles - Each compute needs profile configuration
- Complexity - Platform teams manage N compute configurations
Neutral
Section titled “Neutral”- Connection credentials managed via secrets per compute
- dbt profiles.yml generated with multiple targets
- Compile-time validation ensures compute is in approved list
Hierarchical Governance
Section titled “Hierarchical Governance”Central Platform (Non-Data Mesh)
Section titled “Central Platform (Non-Data Mesh)”# manifest.yaml (Platform Team)plugins: compute: approved: - name: duckdb config: threads: 8 extensions: [httpfs, parquet] - name: spark config: cluster: spark-thrift.floe-platform.svc.cluster.local - name: snowflake config: warehouse: COMPUTE_WH default: duckdb # Example platform-selected fallback when transform doesn't specify
# floe.yaml (Data Engineers)platform: ref: oci://registry.example.com/floe-platform:v1.2.3
transforms: - type: dbt path: models/heavy_processing/ compute: spark # Select from approved list
- type: dbt path: models/analytics/ compute: duckdb # Different compute for this step
- type: dbt path: models/simple/ # No compute specified -> uses the platform-selected fallback from this manifestData Mesh (Enterprise → Domain → Product)
Section titled “Data Mesh (Enterprise → Domain → Product)”# enterprise-manifest.yaml (Enterprise Team)scope: enterpriseplugins: compute: approved: [duckdb, spark, snowflake, bigquery] default: duckdb
---# sales-domain-manifest.yaml (Domain Team)scope: domainparent_manifest: oci://registry/enterprise-manifest:v1.0.0plugins: compute: approved: [duckdb, spark] # Subset of enterprise default: duckdb
---# floe.yaml (Data Product)platform: ref: oci://registry/sales-domain-manifest:v2.0.0
transforms: - type: dbt path: models/ingest/ compute: spark # Must be in domain's approved list
- type: dbt path: models/mart/ compute: duckdbValidation Rules:
- Domain
approvedMUST be a subset of Enterpriseapproved - Transform
computeMUST be in effectiveapprovedlist - Compile-time error if validation fails
CompiledArtifacts Pattern
Section titled “CompiledArtifacts Pattern”from pydantic import BaseModel
class ComputeConfig(BaseModel): """Configuration for a single compute target.""" name: str # duckdb, spark, snowflake, etc. connection_secret_ref: str | None # K8s secret reference properties: dict[str, Any] # Target-specific config
class ComputeRegistry(BaseModel): """All approved compute configurations.""" configs: dict[str, ComputeConfig] # name → config default: str # Fallback when not specified
class TransformConfig(BaseModel): """Configuration for a single transform.""" type: str # dbt, python, etc. path: str compute: str | None # Selected compute (None → default) # ... other fields
class CompiledArtifacts(BaseModel): """Contract between floe-core and consumers.""" compute_registry: ComputeRegistry # All approved computes transforms: list[TransformConfig] # Each may specify computeSupported Targets
Section titled “Supported Targets”| Target | dbt Adapter | Use Case |
|---|---|---|
| DuckDB | dbt-duckdb | Embedded analytics, cost-effective |
| Snowflake | dbt-snowflake | Enterprise data warehouse |
| BigQuery | dbt-bigquery | Google Cloud analytics |
| Redshift | dbt-redshift | AWS data warehouse |
| Databricks | dbt-databricks | Unified analytics |
| Spark | dbt-spark | Distributed big data |
| PostgreSQL | dbt-postgres | Operational analytics |
What Floe Does NOT Do
Section titled “What Floe Does NOT Do”- ❌ Allow per-environment compute selection (prevents drift)
- ❌ Prescribe which targets are “appropriate” for specific use cases
- ❌ Optimize queries for specific targets (dbt handles this)
- ❌ Allow unapproved compute targets
What Floe DOES Do
Section titled “What Floe DOES Do”- ✅ Platform teams approve N compute targets
- ✅ Data engineers select compute per transform
- ✅ Validate compute is in approved list (compile-time)
- ✅ Generate dbt profiles for all approved computes
- ✅ Securely inject connection credentials
- ✅ Enforce environment parity (same compute per step across envs)
- ✅ Support hierarchical governance (Data Mesh)
Example: Multi-Compute Pipeline
Section titled “Example: Multi-Compute Pipeline”# manifest.yaml (Platform Team)plugins: compute: approved: - name: spark config: cluster: spark-thrift.floe-platform.svc.cluster.local port: 10001 - name: duckdb config: threads: 8 memory_limit: 4GB default: duckdb
# floe.yaml (Data Engineer)name: sales-analyticsversion: "1.0.0"
transforms: # Step 1: Heavy processing on Spark cluster - type: dbt path: models/staging/ compute: spark tags: [heavy]
# Step 2: Analytical metrics on DuckDB - type: dbt path: models/marts/ compute: duckdb tags: [analytics]
# Step 3: Simple transforms can use the manifest fallback - type: dbt path: models/seeds/ # compute omitted: uses the platform-selected fallback from the manifestThis pipeline runs identically in dev, staging, and production:
models/staging/always runs on Sparkmodels/marts/always runs on DuckDB- No environment drift
ComputePlugin Interface
Section titled “ComputePlugin Interface”All compute plugins implement the ComputePlugin abstract base class:
from abc import ABC, abstractmethodfrom dataclasses import dataclass
@dataclassclass ComputeConfig: """Configuration passed to compute plugins.""" name: str # duckdb, snowflake, etc. connection_secret: str | None # K8s secret reference properties: dict[str, Any] | None # Target-specific properties
@dataclassclass ConnectionResult: """Result of connection validation.""" success: bool message: str latency_ms: float | None
@dataclassclass ResourceSpec: """K8s resource requirements for dbt job pods.""" cpu_request: str # e.g., "500m" cpu_limit: str # e.g., "2000m" memory_request: str # e.g., "512Mi" memory_limit: str # e.g., "2Gi"
class ComputePlugin(ABC): """Interface for compute engines where dbt transforms execute."""
name: str # e.g., "duckdb", "spark", "snowflake" version: str is_self_hosted: bool # True for DuckDB/Spark, False for Snowflake/BigQuery
@abstractmethod def generate_dbt_profile(self, config: ComputeConfig) -> dict: """Generate dbt profile.yml configuration for this compute target.
Returns dict that becomes the target entry in profiles.yml. Uses dbt's env_var() for secrets - NEVER embed credentials. """ pass
@abstractmethod def get_required_dbt_packages(self) -> list[str]: """Return list of dbt packages/adapters required.
e.g., ["dbt-duckdb>=1.7.0"] or ["dbt-spark>=1.7.0"] """ pass
@abstractmethod def validate_connection(self, config: ComputeConfig) -> ConnectionResult: """Test connection to compute engine.
For cloud: verify credentials and permissions. For self-hosted: verify cluster is accessible. """ pass
@abstractmethod def get_resource_requirements(self, workload_size: str) -> ResourceSpec: """Return K8s resource requirements for dbt job pods.
workload_size: "small" | "medium" | "large" Returns CPU/memory requests/limits for the job pod. """ passPlugin Implementations
Section titled “Plugin Implementations”DuckDB (Self-Hosted, Embedded)
Section titled “DuckDB (Self-Hosted, Embedded)”class DuckDBComputePlugin(ComputePlugin): name = "duckdb" version = "1.0.0" is_self_hosted = True # Runs embedded in dbt pod
def generate_dbt_profile(self, config: ComputeConfig) -> dict: return { "type": "duckdb", "path": config.properties.get("path", "/data/warehouse.duckdb"), "threads": config.properties.get("threads", 4), "extensions": config.properties.get("extensions", ["httpfs", "parquet"]), }
def get_required_dbt_packages(self) -> list[str]: return ["dbt-duckdb>=1.7.0"]
def get_resource_requirements(self, workload_size: str) -> ResourceSpec: specs = { "small": ResourceSpec("500m", "1000m", "1Gi", "2Gi"), "medium": ResourceSpec("1000m", "2000m", "2Gi", "4Gi"), "large": ResourceSpec("2000m", "4000m", "4Gi", "8Gi"), } return specs.get(workload_size, specs["medium"])Spark (Self-Hosted, Distributed)
Section titled “Spark (Self-Hosted, Distributed)”class SparkComputePlugin(ComputePlugin): name = "spark" version = "1.0.0" is_self_hosted = True # Requires Spark cluster deployment
def generate_dbt_profile(self, config: ComputeConfig) -> dict: return { "type": "spark", "method": "thrift", "host": config.properties.get("cluster", "spark-thrift.floe-platform.svc.cluster.local"), "port": config.properties.get("port", 10001), "schema": config.properties.get("schema", "default"), }
def get_required_dbt_packages(self) -> list[str]: return ["dbt-spark>=1.7.0"]
def get_resource_requirements(self, workload_size: str) -> ResourceSpec: # Spark runs on cluster, dbt pod only needs moderate resources return ResourceSpec("500m", "1000m", "512Mi", "1Gi")Snowflake (Cloud, Managed)
Section titled “Snowflake (Cloud, Managed)”class SnowflakeComputePlugin(ComputePlugin): name = "snowflake" version = "1.0.0" is_self_hosted = False # Connects to cloud service
def generate_dbt_profile(self, config: ComputeConfig) -> dict: return { "type": "snowflake", "account": "{{ env_var('SNOWFLAKE_ACCOUNT') }}", "user": "{{ env_var('SNOWFLAKE_USER') }}", "password": "{{ env_var('SNOWFLAKE_PASSWORD') }}", "warehouse": config.properties.get("warehouse", "COMPUTE_WH"), "database": config.properties.get("database"), "schema": config.properties.get("schema", "PUBLIC"), "role": config.properties.get("role"), }
def get_required_dbt_packages(self) -> list[str]: return ["dbt-snowflake>=1.7.0"]
def get_resource_requirements(self, workload_size: str) -> ResourceSpec: # Snowflake runs remotely, pod only needs minimal resources return ResourceSpec("250m", "500m", "256Mi", "512Mi")References
Section titled “References”- ADR-0016: Platform Enforcement Architecture
- ADR-0008: Repository Split - Plugin architecture
- ADR-0038: Data Mesh Architecture - Hierarchical governance
- dbt adapters
- DuckDB production use cases