Skip to content

Opinionation Boundaries

This document defines what is enforced vs pluggable in floe.

floe balances strong opinions with flexibility:

  • ENFORCED: Core platform identity, non-negotiable standards
  • PLUGGABLE: Platform Team selects once, Data Engineers inherit

These standards define floe and cannot be changed:

ComponentStandardRationale
Table FormatApache IcebergOpen, multi-engine, ACID, time-travel
TelemetryOpenTelemetryVendor-neutral industry standard
Data LineageOpenLineageIndustry standard for lineage
DeploymentKubernetes-nativePortable, declarative infrastructure
ConfigurationDeclarative YAMLExplicit over implicit
Transformationdbt-centric”dbt owns SQL” - proven, target-agnostic

Apache Iceberg

  • Provides open table format foundation
  • Enables multi-engine access (Spark, Trino, DuckDB)
  • ACID transactions and time-travel
  • Swapping for Delta Lake would fragment the ecosystem

OpenTelemetry

  • Vendor-neutral observability
  • Single SDK for traces, metrics, logs
  • W3C standard propagation
  • Custom telemetry would create lock-in

OpenLineage

  • Industry standard for data lineage
  • Automatic propagation through pipeline
  • Integrates with Dagster, dbt, Spark
  • Custom lineage would limit interoperability

Kubernetes-native

  • Portable across cloud providers
  • Declarative infrastructure
  • Standard for container orchestration
  • Supporting Docker Compose creates testing parity issues

dbt-centric

  • Proven transformation layer
  • Handles SQL dialect translation
  • Large ecosystem of packages
  • Building custom SQL handling duplicates effort

Platform Team selects these once in manifest.yaml:

ComponentAlpha-Supported Reference PathImplemented AlternativesPlanned Or Ecosystem Examples
ComputeDuckDBNone validated as an alpha product pathSpark, Snowflake, Databricks, BigQuery, Redshift
OrchestrationDagsterNone validated as an alpha product pathAirflow 3.x, Prefect, Argo Workflows
CatalogPolarisNone validated as an alpha product pathAWS Glue, Hive Metastore, Nessie
StorageS3-compatible object storage through the implemented storage plugin; demo uses MinIOS3-compatible backends where configured and validated by the platform teamGCS, Azure Blob, provider-native object storage
Telemetry BackendJaeger and console telemetry pluginsOTLP-compatible backends through standard OpenTelemetry configurationDatadog, Grafana Cloud, AWS X-Ray
Lineage BackendMarquezNone validated as an alpha product pathAtlan, OpenMetadata, Egeria
dbt Runtimedbt Coredbt Fusion plugin exists as an implementation path requiring explicit validationdbt Cloud
Semantic LayerCube reference implementationNone validated as an alpha product pathdbt Semantic Layer
Ingestiondlt plugin primitiveNone validated as a full product pathAirbyte-style integrations
Data Quality Frameworkdbt expectations and Great Expectations plugin primitivesNone validated as a full product pathSoda, custom
SecretsKubernetes Secrets and Infisical plugin primitivesNone validated as a full product pathVault, External Secrets Operator

Compute

  • Organizations have existing investments
  • Different scale requirements (DuckDB vs Spark)
  • Cost considerations (self-hosted vs cloud)
  • All compute targets produce Iceberg tables (enforced)

Orchestration

  • Many organizations already use Airflow
  • Different feature requirements
  • Operational familiarity matters
  • All orchestrators emit OpenLineage (enforced)

Catalog

  • Cloud provider preferences (AWS → Glue)
  • Existing infrastructure investments
  • Different feature requirements
  • All catalogs support Iceberg (enforced)

Ingestion

  • Different connector requirements
  • Existing Airbyte deployments
  • Scale and complexity tradeoffs
  • All ingestion writes to Iceberg (enforced)

Storage

  • Cloud provider preferences (AWS S3 vs GCP GCS vs Azure Blob)
  • Data sovereignty requirements (on-prem MinIO, NetApp)
  • Multi-cloud strategies (S3 + GCS for disaster recovery)
  • Cost optimization (MinIO vs cloud object storage)
  • All storage via PyIceberg FileIO (enforced)

Telemetry Backend

  • Existing telemetry investments (Datadog APM, Grafana Cloud)
  • Cost considerations (self-hosted Jaeger vs SaaS backends)
  • Feature requirements (APM, distributed tracing, alerting, metrics visualization)
  • Compliance needs (data residency for telemetry data)
  • All telemetry via OpenTelemetry + OTLP Collector (enforced)

Lineage Backend

  • Existing lineage investments (Atlan, OpenMetadata)
  • Cost considerations (self-hosted Marquez vs SaaS data catalogs)
  • Feature requirements (impact analysis, column-level lineage, data governance)
  • Integration with existing data catalogs (Atlan, Collibra)
  • All lineage via OpenLineage HTTP transport (enforced)

Data Quality Framework

  • Different quality check requirements (statistical vs rule-based)
  • Existing Great Expectations or Soda investments
  • Feature requirements (expectation suites vs YAML checks)
  • Integration preferences (Python API vs CLI)
  • All quality plugins via DataQualityPlugin interface (enforced)
  • dbt tests remain enforced (wrapped by DBTExpectationsPlugin for unified scoring)
CriteriaExample
Core platform identityIceberg table format
Cross-cutting concernOpenTelemetry observability
Industry standardOpenLineage lineage
Deployment modelKubernetes-native
Significant re-architecture to swapdbt transformation
CriteriaExample
Multiple valid options existCompute: DuckDB vs Snowflake
Organization already has choiceOrchestration: existing Airflow
Different scale requirementsSpark vs DuckDB
Cloud provider preferenceAWS Glue vs Polaris
Cost considerationsManaged vs self-hosted
# manifest.yaml (Platform Team)
apiVersion: floe.dev/v1
kind: Manifest
metadata:
name: acme-platform
version: "1.0.0"
scope: enterprise
plugins:
# PLUGGABLE: Platform Team selects from alpha-supported and validated options
compute: duckdb # Alpha-supported reference path
orchestrator: dagster # Alpha-supported reference path
catalog: polaris # Alpha-supported reference path
storage: s3 # S3-compatible storage plugin; demo uses MinIO
telemetry_backend: jaeger # Alpha-supported telemetry backend
lineage_backend: marquez # Alpha-supported lineage backend
semantic_layer: cube # Reference implementation
ingestion: dlt # Plugin primitive
# ENFORCED: Cannot change
# - Iceberg (all tables are Iceberg)
# - OpenTelemetry (all telemetry via OTel)
# - OpenLineage (all lineage via OpenLineage)
# - dbt (all transforms via dbt)
# - K8s (all deployment via K8s)
# floe.yaml (Data Team)
apiVersion: floe.dev/v1
kind: DataProduct
metadata:
name: customer-analytics
version: "1.0"
platform:
ref: oci://registry.acme.com/floe-platform:v1.2.3
# Data Engineers inherit platform-approved choices and defaults.
# They may select compute per transform only from the approved list.
transforms:
- type: dbt # ENFORCED: must use dbt
path: models/
compute: duckdb

DO: Allow Approved Per-Transform Compute Selection

Section titled “DO: Allow Approved Per-Transform Compute Selection”

Platform Engineers approve compute targets and choose defaults. Data Engineers may select compute per transform only from that approved list.

plugins:
compute:
approved:
- name: duckdb
- name: spark
default: duckdb
transforms:
- type: dbt
path: models/staging
compute: spark
- type: dbt
path: models/marts
compute: duckdb
transforms:
- type: dbt
path: models/marts
compute: unapproved-snowflake-account

DON’T: Create Per-Environment Compute Drift

Section titled “DON’T: Create Per-Environment Compute Drift”
environments:
development:
compute: duckdb
production:
compute: snowflake

Because core components are enforced:

GuaranteeHow
All tables are IcebergEnforced table format
All telemetry is OTelEnforced observability
All lineage is OpenLineageEnforced lineage
All transforms use dbtEnforced transformation
All deployment is K8sEnforced infrastructure

This enables:

  • Multi-engine queries (any engine can read Iceberg)
  • Unified observability (single dashboard for all pipelines)
  • Complete lineage (end-to-end data flow visibility)
  • Consistent testing (K8s in CI matches production)