Skip to content

ADR-0012: Data Classification and Governance Architecture

Accepted

Floe needs a coherent approach to data classification, access control, and governance that spans multiple layers:

  1. Data Classification - Identifying sensitive data (PII, PHI, financial, etc.)
  2. Storage Access Control - Who can read/write tables (Polaris RBAC)
  3. Query Access Control - Row-level security, column masking (Cube)
  4. API Access Control - Endpoint authentication (Cube APIs)

Security operates at multiple layers with different responsibilities:

LayerTechnologyCapability
Metadatadbt meta tagsClassification definition
LineageOpenLineage facetsClassification propagation
StoragePolaris RBACTable/namespace access
ComputeIceberg (via engines)Row/column filtering (engine-dependent)
ConsumptionCube queryRewriteRow-level security, column masking
  1. Where should classification metadata live?
  2. How should it propagate through the pipeline?
  3. How is enforcement handled at compile time vs runtime?

1. Classification Source of Truth: dbt meta tags

Section titled “1. Classification Source of Truth: dbt meta tags”

Data classification is defined in dbt model YAML using a floe: namespace in the meta field:

models/staging/stg_customers.yml
models:
- name: stg_customers
columns:
- name: email
meta:
floe:
classification: pii
pii_type: email
sensitivity: high
- name: revenue
meta:
floe:
classification: financial
sensitivity: medium

Rationale:

  • dbt is already the source of truth for data models
  • meta field is designed for exactly this purpose
  • Classification travels with the model definition
  • No separate governance file to maintain

2. FloeSpec Governance Section: Policy Definition

Section titled “2. FloeSpec Governance Section: Policy Definition”

floe.yaml defines policies for how to handle classified data:

governance:
classification:
source: dbt_meta # Where to read classifications from
policies:
pii:
production:
action: restrict # Cube RLS enforces access
non_production:
action: synthesize # Generate synthetic data
high_sensitivity:
production:
action: restrict
requires_role: [data_owner, compliance]
non_production:
action: redact

Environment Type Mapping:

The system supports 4 environment types (development, preview, staging, production) but governance policies use a binary classification for simplicity:

Environment TypeGovernance CategoryRationale
developmentnon_productionLocal development, ephemeral
previewnon_productionPR environments, ephemeral
stagingnon_productionPre-production testing, production-like controls but should not contain unmanaged sensitive data
productionproductionLive customer data, full restrictions apply

Note: While staging is production-like in terms of infrastructure and access controls, it is classified as non_production for data governance purposes. This means staging environments should use synthetic data or sanitized datasets rather than copies of production data. This reduces exfiltration risk and simplifies compliance.

Rationale:

  • Policies are project-level concerns (belong in floe.yaml)
  • Separates “what is sensitive” (dbt meta) from “what to do about it” (policies)
  • Enables different behaviors per environment
  • Binary production/non_production split simplifies policy configuration while covering the common case

3. Classification Propagation: OpenLineage Facets

Section titled “3. Classification Propagation: OpenLineage Facets”

Classifications flow through the pipeline via OpenLineage custom facets:

class FloeClassificationFacet(BaseFacet):
"""Custom OpenLineage facet for data classification."""
_schemaURL = "https://floe.dev/spec/facets/ClassificationFacet.json"
columns: dict[str, ColumnClassification]
@dataclass
class ColumnClassification:
classification: str # pii, financial, identifier, public
pii_type: str | None # email, phone, ssn, address, name
sensitivity: str # low, medium, high, critical

Rationale:

  • OpenLineage is already used for lineage tracking
  • Custom facets are the standard extension mechanism
  • Enables downstream consumers (Marquez, DataHub) to visualize classifications
  • Classification propagates automatically with lineage

4. Enforcement at Compile Time and Runtime

Section titled “4. Enforcement at Compile Time and Runtime”
CapabilityCompile TimeRuntime
Classification ValidationCheck dbt meta tags existN/A
Policy ComplianceValidate policies against classificationN/A
Storage RBAC (Polaris)Validate namespace configurationEnforce table access
Query RLS (Cube)Validate security rulesEnforce row-level security
Column MaskingValidate masking configurationApply masking in queries
  • Single source of truth - Classification lives with the data model in dbt
  • Automatic propagation - OpenLineage carries classification through pipeline
  • Compile-time validation - Policies validated before runtime
  • Audit trail - Lineage + classification = compliance evidence
  • Manual configuration - Users must set up Polaris RBAC and Cube security
  • dbt dependency - Classification requires dbt meta tags (no alternative for non-dbt transforms)
  • Facet adoption - Downstream tools must understand Floe’s custom facets
  • Iceberg itself has no native FGAC - enforcement happens at compute/consumption layer
  • Polaris RBAC provides table-level access but not row/column level
  • Row/column security is Cube’s responsibility in the Floe stack
  1. Define floe: meta schema for dbt models
  2. Add governance: section to Manifest/DataProduct
  3. Emit FloeClassificationFacet in OpenLineage events
  4. Document Cube security configuration
  1. Read classification from dbt manifest
  2. Generate Cube security rules from classification + policies
  3. Validate policy compliance at compile time
TypeDescriptionExample Columns
piiPersonally Identifiable Informationemail, phone, ssn
phiProtected Health Informationdiagnosis, prescription
financialFinancial datarevenue, salary, account_number
identifierBusiness identifierscustomer_id, order_id
publicNon-sensitive dataproduct_name, category
SubtypeDescriptionSynthetic Generator
emailEmail addressesFaker.email()
phonePhone numbersFaker.phone_number()
namePerson namesFaker.name()
addressPhysical addressesFaker.address()
ssnSocial Security NumbersFormat-preserving hash
dobDate of birthAge-range preserving
ip_addressIP addressesSubnet-preserving
LevelDescriptionDefault Policy
lowMinimal risk if exposedNo restrictions
mediumBusiness-sensitiveRole-based access
highRegulatory concernStrict access + audit
criticalMaximum protectionExplicit approval required

Quality gates integrate with the governance model to enforce data quality at compile time and runtime.

Defined in manifest.yaml by Platform Team:

governance:
quality_gates:
# Minimum requirements for all models
minimum_test_coverage: 80 # % of columns with tests
required_tests:
- not_null # Primary keys must be not null
- unique # Primary keys must be unique
- freshness # Source freshness checks
# Enforcement behavior
enforcement: strict # off | warn | strict
block_on_failure: true
# Per-layer requirements
layers:
bronze:
required_tests: [not_null_pk]
minimum_coverage: 50
silver:
required_tests: [not_null_pk, unique_pk, freshness]
minimum_coverage: 80
gold:
required_tests: [not_null_pk, unique_pk, freshness, documentation]
minimum_coverage: 100
TierScopeImplementationDefault
Tier 1dbt nativedbt tests + dbt-expectationsAlways on
Tier 2External frameworksGreat Expectations, SodaOptional
Tier 3Quality gatesBlock/warn/notify enforcementConfigurable
# floe.yaml - Quality section
quality:
# Tier 1: dbt native tests (always run)
dbt_tests:
enabled: true
fail_on_warning: false
# Tier 3: Quality gates (enforcement)
gates:
- name: staging_completeness
scope:
tags: [staging]
checks:
- type: row_count
min: 1
- type: null_percentage
columns: ["*_id"]
max: 0
on_failure: block # block | warn | notify
- name: gold_freshness
scope:
layer: gold
checks:
- type: freshness
max_age_hours: 24
on_failure: warn
LevelBehaviorUse Case
offNo enforcementDevelopment/experimentation
warnLog warnings, continueSoft rollout of new rules
strictBlock pipeline on violationProduction enforcement

Quality requirements escalate based on data classification:

ClassificationMinimum CoverageRequired TestsAdditional
public50%not_null-
internal80%not_null, uniquefreshness
confidential100%allaudit_log
pii100%allaudit_log, masking_verified
phi100%allaudit_log, encryption_verified
Terminal window
$ floe compile # planned root data-team command; not alpha-supported yet
[1/5] Loading platform artifacts
Quality gates: 3 rules loaded
[2/5] Analyzing dbt project
24 models, 156 tests
[3/5] Checking test coverage
bronze layer: 62% coverage (min: 50%)
silver layer: 85% coverage (min: 80%)
ERROR: gold layer: 78% coverage (min: 100%)
Missing tests for: gold_revenue.margin_pct
[4/5] Validating quality gates
ERROR: Model 'gold_revenue' missing required tests
Required: [not_null_pk, unique_pk, freshness, documentation]
Missing: [documentation]
[5/5] Compilation FAILED
Fix quality violations and re-run the planned root `floe compile` flow