Skip to content

ADR-0021: Data Architecture Patterns

Accepted

Data platforms use different architectural patterns for organizing data:

Data Modeling Patterns:

  1. Medallion Architecture (Bronze/Silver/Gold) - Popularized by Databricks
  2. Kimball Dimensional (Facts/Dimensions) - Classic data warehousing
  3. Data Vault (Hubs/Links/Satellites) - Enterprise scalability
  4. Hybrid - Combination of patterns

Organizational Patterns: 5. Data Mesh - Federated, domain-oriented data ownership

floe needs to support these patterns while enforcing consistency within each pattern through naming conventions and quality gates.

Key Distinction: Data Mesh is an organizational pattern that defines HOW teams own data, while medallion/kimball/data_vault are modeling patterns that define HOW data is structured. A Data Mesh organization can use any modeling pattern within each domain.

Support multiple data architecture patterns via manifest.yaml, with medallion as the default. Additionally, support Data Mesh as an organizational pattern that can be layered on top of any modeling pattern.

Data Modeling Patterns:

PatternDefault NamingLayers/ComponentsUse Case
medallionbronze_*, silver_*, gold_*Bronze, Silver, GoldGeneral purpose, modern data lakehouse
kimballdim_*, fact_*, stg_*Staging, Facts, DimensionsBusiness intelligence, reporting
data_vaulthub_*, link_*, sat_*Hubs, Links, SatellitesEnterprise, audit-heavy
hybridCustom per layerConfigurableComplex requirements

Organizational Patterns:

PatternConfigurationKey ResourcesUse Case
centralizedManifest (scope: enterprise) + DataProductDataProductTraditional centralized data team
data_meshManifest chain (enterprise → domain) + DataProductDataProduct, DataContractFederated domain ownership
┌─────────────────────────────────────────────────────────────────────────┐
│ GOLD LAYER (Business-Ready) │
│ Prefix: gold_* │
│ Purpose: Aggregated, business-specific views │
│ Quality: Maximum (100% coverage, documentation required) │
│ │
│ Examples: gold_revenue_by_region, gold_customer_lifetime_value │
└─────────────────────────────────────────────────────────────────────────┘
│ Aggregates, joins
┌─────────────────────────────────────────────────────────────────────────┐
│ SILVER LAYER (Cleaned, Conformed) │
│ Prefix: silver_* │
│ Purpose: Cleaned, deduplicated, standardized data │
│ Quality: High (80% coverage, freshness checks) │
│ │
│ Examples: silver_customers, silver_orders, silver_products │
└─────────────────────────────────────────────────────────────────────────┘
│ Cleans, deduplicates
┌─────────────────────────────────────────────────────────────────────────┐
│ BRONZE LAYER (Raw) │
│ Prefix: bronze_* │
│ Purpose: Raw ingested data, minimal transformation │
│ Quality: Basic (primary key not null) │
│ │
│ Examples: bronze_salesforce_accounts, bronze_stripe_payments │
└─────────────────────────────────────────────────────────────────────────┘
  • Consistency - All models follow the same pattern
  • Discoverability - Naming conventions make models easy to find
  • Quality alignment - Quality gates match layer expectations
  • Flexibility - Organizations can choose their preferred pattern
  • Migration effort - Existing projects may need renaming
  • Learning curve - Teams must understand chosen pattern
  • Enforcement overhead - Compile-time validation adds steps
  • Pattern choice is made once at platform level
  • Data engineers work within the chosen pattern
  • Pattern can be changed (with migration)
manifest.yaml
data_architecture:
pattern: medallion # medallion | kimball | data_vault | hybrid
# Medallion configuration
layers:
bronze:
prefix: "bronze_"
description: "Raw ingested data, minimal transformation"
quality_gates:
required_tests: [not_null_pk]
minimum_coverage: 50
catalog_namespace: bronze
silver:
prefix: "silver_"
description: "Cleaned, conformed, deduplicated data"
quality_gates:
required_tests: [not_null_pk, unique_pk, freshness]
minimum_coverage: 80
catalog_namespace: silver
gold:
prefix: "gold_"
description: "Business-ready aggregates and metrics"
quality_gates:
required_tests: [not_null_pk, unique_pk, freshness, documentation]
minimum_coverage: 100
catalog_namespace: gold
naming:
enforcement: strict # off | warn | strict
case: snake_case # snake_case | camelCase | PascalCase
data_architecture:
pattern: kimball
layers:
staging:
prefix: "stg_"
description: "Staging area for raw data"
quality_gates:
required_tests: [not_null_pk]
minimum_coverage: 50
dimensions:
prefix: "dim_"
description: "Dimension tables (slowly changing)"
quality_gates:
required_tests: [not_null_pk, unique_pk, scd_validation]
minimum_coverage: 100
scd:
types_allowed: [1, 2] # Type 1: Overwrite, Type 2: Add Row
default_type: 2
facts:
prefix: "fact_"
description: "Fact tables (transactional)"
quality_gates:
required_tests: [not_null_pk, referential_integrity]
minimum_coverage: 100

Dimension tables require SCD handling for attributes that change over time:

SCD TypeBehaviorUse Case
Type 1Overwrite current value, no historyCurrent value only needed
Type 2Add new row with effective datesFull audit trail required

Type 2 Schema Requirements:

ColumnTypeDescription
surrogate_keySTRINGUnique per version (facts reference this)
{business_key}*Natural business identifier
effective_fromTIMESTAMPWhen this version became active
effective_toTIMESTAMPWhen expired (9999-12-31 for current)
is_currentBOOLEANFlag for current record

dbt Snapshot Integration:

-- snapshots/snap_customer.sql
{% snapshot snap_customer %}
{{
config(
unique_key='customer_id',
strategy='check',
check_cols=['email', 'address', 'status']
)
}}
SELECT * FROM {{ source('raw', 'customers') }}
{% endsnapshot %}

Fact Table Joins:

-- Fact tables join to dimension at point-in-time
SELECT f.*, d.*
FROM fact_orders f
JOIN dim_customer d
ON f.customer_id = d.customer_id
AND f.order_date >= d.effective_from
AND f.order_date < d.effective_to

See SCD Patterns Guide for complete implementation examples.

data_architecture:
pattern: data_vault
layers:
hubs:
prefix: "hub_"
description: "Business keys"
quality_gates:
required_tests: [unique_business_key, hash_key]
minimum_coverage: 100
links:
prefix: "link_"
description: "Relationships between hubs"
quality_gates:
required_tests: [referential_integrity]
minimum_coverage: 100
satellites:
prefix: "sat_"
description: "Descriptive attributes"
quality_gates:
required_tests: [load_date, record_source]
minimum_coverage: 100
Terminal window
$ floe compile # planned root data-team command; not alpha-supported yet
[1/4] Loading platform artifacts
Pattern: medallion
Enforcement: strict
[2/4] Validating naming conventions
bronze_salesforce_accounts: valid (bronze layer)
silver_customers: valid (silver layer)
ERROR: 'stg_orders' violates naming convention
Expected: bronze_*, silver_*, or gold_* prefix
Model appears to be staging data use bronze_orders
[3/4] Validating quality gates
bronze layer: all models have not_null_pk test
ERROR: silver_customers missing freshness test
Required for silver layer: [not_null_pk, unique_pk, freshness]
[4/4] Compilation FAILED
Fix naming and quality violations, then re-run the planned root `floe compile` flow

When changing patterns, a migration is required:

Terminal window
# 1. Update manifest.yaml
data_architecture:
pattern: kimball # Changed from medallion
# 2. Run migration analysis
floe migrate analyze
# Output:
# bronze_customers → stg_customers
# silver_customers → dim_customer
# gold_revenue → fact_revenue
# 35 models require renaming
# 3. Generate migration script
floe migrate generate --output=migrate.sql
# 4. Review and apply
floe migrate apply --dry-run
floe migrate apply

Naming conventions are validated against dbt models:

models/bronze/bronze_salesforce_accounts.yml
version: 2
models:
- name: bronze_salesforce_accounts
description: "Raw Salesforce account data"
meta:
floe:
layer: bronze # Optional: explicit layer declaration
columns:
- name: id
tests:
- not_null # Required for bronze layer

Each layer maps to a catalog namespace:

LayerNamespaceAccess
bronzebronzeData Engineers: READ_WRITE
silversilverData Engineers: READ_WRITE, Analysts: READ
goldgoldAnalysts: READ, BI Tools: READ
manifest.yaml
catalog_structure:
namespaces:
- name: bronze
grants:
- role: data_engineers
access: READ_WRITE
- name: silver
grants:
- role: data_engineers
access: READ_WRITE
- role: analysts
access: READ
- name: gold
grants:
- role: analysts
access: READ
- role: bi_tools
access: READ

Data Mesh is an organizational architecture for analytical data that applies product thinking and domain-driven design principles. It fundamentally changes HOW data is owned and operated.

PrincipleDescriptionfloe Support
Domain OwnershipDomains own their data end-to-endDomain-level platform manifests
Data as a ProductData has clear owners, SLAs, contractsDataProduct resource type
Self-Serve PlatformPlatform enables domain autonomyfloe IS the platform
Federated GovernanceGovernance is automated, federatedThree-tier config inheritance

Data Mesh uses floe’s unified Manifest type with inheritance via parent: reference:

┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 1: ENTERPRISE MANIFEST │
│ Owner: Platform Team (Central) │
│ Config: kind: Manifest, scope: enterprise │
│ Published: oci://registry/enterprise-platform:v1.0 │
│ │
│ Defines: │
│ • Global governance policies (PII handling, retention) │
│ • Approved plugin list (which computes, catalogs are allowed) │
│ • Minimum quality standards (all domains must meet) │
│ • Cross-domain interoperability rules │
│ • Enterprise-wide data classification scheme │
└───────────────────────────────────┬─────────────────────────────────────────┘
│ inherits via parent:
┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 2: DOMAIN MANIFESTS │
│ Owner: Domain Platform Team (per domain) │
│ Config: kind: Manifest, scope: domain, parent: ref to enterprise │
│ Published: oci://registry/domains/{domain}-platform:v1.0 │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ SALES │ │ MARKETING │ │ FINANCE │ │
│ │ DOMAIN │ │ DOMAIN │ │ DOMAIN │ │
│ │ │ │ │ │ │ │
│ │ compute: spark │ │ compute: duckdb │ │ compute: snowfl │ │
│ │ pattern: medal │ │ pattern: medal │ │ pattern: kimball│ │
│ │ │ │ │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Each domain can: │
│ • Choose compute from approved list │
│ • Choose data modeling pattern │
│ • Add domain-specific policies │
│ • Own domain namespace in catalog │
└───────────────────────────────────┬─────────────────────────────────────────┘
│ inherits via domain:
┌─────────────────────────────────────────────────────────────────────────────┐
│ TIER 3: DATA PRODUCTS │
│ Owner: Domain Data Product Team │
│ Config: kind: DataProduct, domain: ref to domain manifest │
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ customer-360 │ │ campaign-perf │ │ revenue-metrics │ │
│ │ (Sales) │ │ (Marketing) │ │ (Finance) │ │
│ │ │ │ │ │ │ │
│ │ Output ports: │ │ Output ports: │ │ Output ports: │ │
│ │ • customers │ │ • campaigns │ │ • revenue │ │
│ │ • segments │ │ • attribution │ │ • forecasts │ │
│ │ │ │ │ │ │ │
│ │ Input ports: │ │ Input ports: │ │ Input ports: │ │
│ │ • crm_raw │ │ • ad_platforms │ │ • sales.orders │ │
│ │ • support_tix │ │ • sales.custs │ │ • billing │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

The enterprise manifest defines global governance that ALL domains must follow:

enterprise-manifest.yaml
apiVersion: floe.dev/v1
kind: Manifest
metadata:
name: acme-enterprise
version: "1.0.0"
scope: enterprise
# Approved plugins - domains choose from this list
approved_plugins:
compute:
- duckdb # Allowed for all domains
- spark # Allowed for all domains
- snowflake # Requires finance approval
orchestrator:
- dagster # Standard
catalog:
- polaris # Enterprise standard
ingestion:
- dlt # Standard
- airbyte # Allowed for high-volume
# Global governance - applies to ALL domains
governance:
classification:
levels: [public, internal, confidential, pii, phi]
pii:
requires_encryption: true
retention_max_days: 90
requires_masking_in_non_prod: true
quality_gates:
minimum_test_coverage: 70 # All domains must meet this
required_tests: [not_null_pk, unique_pk]
interoperability:
# Cross-domain data sharing rules
data_contracts:
required: true
schema_registry: true
sla_enforcement: true
# Domain registration
domains:
- name: sales
namespace: sales
owners: [sales-data-team@acme.com]
- name: marketing
namespace: marketing
owners: [marketing-data-team@acme.com]
- name: finance
namespace: finance
owners: [finance-data-team@acme.com]
# Finance has stricter requirements
extra_policies:
audit_logging: required
compute_approved: [snowflake] # Finance must use Snowflake

Each domain has its own manifest that inherits from enterprise:

domains/sales/domain-manifest.yaml
apiVersion: floe.dev/v1
kind: Manifest
metadata:
name: sales-domain
version: "2.1.0"
scope: domain
# Inherit from enterprise
parent:
ref: oci://registry.acme.com/enterprise-platform:v1.0.0
# Domain identity
domain:
name: sales
namespace: sales
owners:
- team: sales-data-team
email: sales-data-team@acme.com
slack: "#sales-data"
# Domain plugin choices (from approved list)
plugins:
compute:
type: spark # Sales chose Spark for their volume
# Inherits: orchestrator, catalog, ingestion from enterprise
# Domain data architecture
data_architecture:
pattern: medallion # Sales uses medallion
layers:
bronze:
prefix: "sales_bronze_"
catalog_namespace: sales.bronze
silver:
prefix: "sales_silver_"
catalog_namespace: sales.silver
gold:
prefix: "sales_gold_"
catalog_namespace: sales.gold
# Domain-specific policies (additive to enterprise)
governance:
quality_gates:
minimum_test_coverage: 80 # Stricter than enterprise minimum of 70
gold_layer:
minimum_coverage: 100
required_documentation: true
# Domain services (optional overrides)
services:
compute:
spark:
executors:
instances: 10 # Sales needs more Spark capacity
memory: 8Gi

Data Products are the unit of deployment in Data Mesh:

domains/sales/products/customer-360/floe.yaml
apiVersion: floe.dev/v1
kind: DataProduct
metadata:
name: customer-360
version: "3.2.1"
description: "Unified view of customer across all touchpoints"
# Inherit from domain
domain:
ref: oci://registry.acme.com/domains/sales-platform:v2.1.0
# Product ownership
ownership:
team: sales-analytics
email: sales-analytics@acme.com
slack: "#customer-360-support"
# Output ports - what this product exposes to consumers
output_ports:
- name: customers
description: "Unified customer dimension"
table: sales.gold.customers
schema:
ref: schemas/customers.json
sla:
freshness: 6h # Updated within 6 hours
availability: 99.9% # Uptime SLA
quality:
completeness: 99%
accuracy: 98%
access:
default: deny
grants:
- domain: marketing
access: read
- domain: finance
access: read
- role: analysts
access: read
- name: customer_segments
description: "Customer segmentation for targeting"
table: sales.gold.customer_segments
sla:
freshness: 24h
availability: 99%
# Input ports - dependencies on other products/sources
input_ports:
- name: crm_data
description: "Raw CRM data from Salesforce"
source:
type: ingestion
config:
type: dlt
source: salesforce
freshness_requirement: 1h
- name: support_tickets
description: "Support ticket data from Zendesk"
source:
type: ingestion
config:
type: dlt
source: zendesk
- name: marketing_interactions
description: "Customer interactions from Marketing domain"
source:
type: data_product
ref: marketing.campaign-attribution.customer_interactions
# Creates a dependency contract
# Transforms (dbt models within this product)
transforms:
- type: dbt
path: models/
# Schedule
schedule:
cron: "0 */6 * * *" # Every 6 hours to meet SLA

When data products depend on each other, contracts are established:

# Automatically generated when marketing.campaign-attribution
# is declared as an input port for sales.customer-360
apiVersion: floe.dev/v1
kind: DataContract
metadata:
name: marketing-to-sales-customer-interactions
version: "1.0.0"
provider:
domain: marketing
product: campaign-attribution
port: customer_interactions
consumer:
domain: sales
product: customer-360
schema:
columns:
- name: customer_id
type: string
required: true
- name: interaction_type
type: string
enum: [email_open, email_click, ad_impression, ad_click]
- name: interaction_timestamp
type: timestamp
required: true
- name: campaign_id
type: string
sla:
freshness: 2h
availability: 99%
# Contract validation at compile time
validation:
schema_compatibility: backward # Provider can add columns, not remove
breaking_change_notification: 7d # 7 day notice for breaking changes

The catalog organizes data by domain:

catalog/
├── enterprise/ # Shared enterprise data
│ └── reference_data/
│ ├── countries
│ ├── currencies
│ └── fiscal_calendar
├── sales/ # Sales domain
│ ├── bronze/
│ │ ├── salesforce_accounts
│ │ └── salesforce_opportunities
│ ├── silver/
│ │ ├── accounts
│ │ └── opportunities
│ └── gold/
│ ├── customers # Output port: customer-360.customers
│ └── customer_segments # Output port: customer-360.segments
├── marketing/ # Marketing domain
│ ├── bronze/
│ ├── silver/
│ └── gold/
│ └── customer_interactions # Output port consumed by sales
└── finance/ # Finance domain
├── staging/ # Kimball pattern
├── dimensions/
└── facts/

Data products are intended to be discoverable via the catalog in the target-state Data Mesh model.

Alpha caveat: Data Mesh discovery operations and product registration are planned and not alpha-supported. See Capability Status for current release boundaries. The commands below are target-state examples, not executable v0.1.0-alpha.1 workflows.

Terminal window
# List all data products
floe products list
# Output:
DOMAIN PRODUCT VERSION OWNER OUTPUT PORTS
sales customer-360 3.2.1 sales-analytics customers, customer_segments
sales opportunity-pipeline 2.0.0 sales-ops pipeline_stages, forecasts
marketing campaign-attribution 1.5.0 marketing-analytics customer_interactions, attribution
finance revenue-metrics 4.1.0 finance-analytics revenue, forecasts
# View product details
floe products describe sales.customer-360
# Output:
Product: customer-360
Domain: sales
Version: 3.2.1
Owner: sales-analytics@acme.com
Output Ports:
customers (sales.gold.customers)
SLA: freshness=6h, availability=99.9%
Consumers: marketing.campaign-attribution, finance.revenue-metrics
customer_segments (sales.gold.customer_segments)
SLA: freshness=24h, availability=99%
Consumers: marketing.targeting-engine
Input Ports:
crm_data (ingestion: salesforce)
support_tickets (ingestion: zendesk)
marketing_interactions (marketing.campaign-attribution.customer_interactions)

Target-State Federated Governance Enforcement

Section titled “Target-State Federated Governance Enforcement”

Alpha caveat: This is a target-state Data Mesh governance example. The current alpha supports the documented Customer 360 path and platform compilation artifacts; federated Data Mesh operations remain planned.

Terminal window
$ floe compile # planned root data-team command; not alpha-supported yet
[1/6] Loading enterprise platform
Enterprise: acme-enterprise v1.0.0
Global policies loaded
[2/6] Loading domain platform
Domain: sales v2.1.0
Domain policies loaded (merged with enterprise)
[3/6] Validating data product
Product: customer-360 v3.2.1
Ownership: sales-analytics@acme.com
[4/6] Validating data contracts
Input: marketing.campaign-attribution.customer_interactions
Schema compatibility: OK
SLA: provider=2h, required=4h: OK
ERROR: Input port 'finance.revenue.invoices' not found
The finance domain does not expose this port
[5/6] Validating governance
PII columns: encrypted ✓, masked in non-prod
Quality gates: coverage=92% (min=80%) ✓
Domain namespace: sales.*
[6/6] Compilation FAILED
Fix data contract violation and re-run the planned root `floe compile` flow

Each domain can choose its own data modeling pattern:

DomainModeling PatternRationale
SalesMedallionHigh volume, streaming updates
MarketingMedallionEvent-driven, time-series heavy
FinanceKimballBI-centric, regulatory reporting
RiskData VaultAudit requirements, historical tracking

This is configured per domain:

# sales uses medallion
data_architecture:
pattern: medallion
# finance uses kimball
data_architecture:
pattern: kimball

In Data Mesh, domains can have their own infrastructure:

┌─────────────────────────────────────────────────────────────────────────────┐
│ SHARED PLATFORM SERVICES │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Polaris │ │ OTLP │ │ Prometheus │ │ Grafana │ │
│ │ (Catalog) │ │ Collector │ │ (Metrics) │ │ (Dashboards)│ │
│ └─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Shared by all domains - managed by Platform Team │
└─────────────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────────────┐
│ DOMAIN-SPECIFIC SERVICES │
│ │
│ ┌───────────────────────┐ ┌───────────────────────┐ │
│ │ SALES DOMAIN │ │ FINANCE DOMAIN │ │
│ │ │ │ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Dagster (sales) │ │ │ │Dagster (finance)│ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Spark Cluster │ │ │ │ Snowflake Conn │ │ │
│ │ │ (self-hosted) │ │ │ │ (cloud managed) │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ ┌─────────────────┐ │ │ ┌─────────────────┐ │ │
│ │ │ Cube (sales) │ │ │ │ Cube (finance) │ │ │
│ │ └─────────────────┘ │ │ └─────────────────┘ │ │
│ │ │ │ │ │
│ │ Namespace: sales │ │ Namespace: finance │ │
│ │ Compute: Spark │ │ Compute: Snowflake │ │
│ │ Owner: sales-team │ │ Owner: finance-team │ │
│ └───────────────────────┘ └───────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

The following commands describe the intended Data Mesh operator and product-team experience. They are target-state examples, not supported alpha commands. Current alpha-supported paths are documented in Capability Status.

Terminal window
# Platform commands (Platform Team)
floe platform compile # Compile any Manifest (enterprise or domain)
floe platform publish v1.0.0 # Publish to OCI registry
floe platform deploy # Deploy platform services
# Data Product commands (Product Team)
floe init --platform=v1.0.0 # Planned root command: initialize with centralized manifest
floe init --domain=sales:v2.1.0 # Planned root command: initialize with domain manifest
floe compile # Planned root command: compile DataProduct
floe run # Planned root command: execute pipeline
# Discovery commands
floe products list
floe products describe sales.customer-360
floe contracts list
floe contracts describe marketing-to-sales-customer-interactions