Skip to content

floe.yaml Schema Reference

This document provides the complete schema reference for floe.yaml configuration files.


floe.yaml is the configuration file for floe data products. It defines:

  • Platform reference (enforced configuration)
  • Transforms (dbt models)
  • Ingestion sources
  • Schedules
  • Environment overrides
apiVersion: floe.dev/v1
kind: DataProduct
metadata:
name: customer-analytics
version: "1.0.0"
domain: sales
platform:
ref: oci://registry.example.com/platform:v1.0.0
transforms:
- type: dbt
path: models/

FloeSpec
├── apiVersion: string (required)
├── kind: string (required)
├── metadata: MetadataSpec (required)
├── platform: PlatformRef (required)
├── transforms: TransformSpec[] (required)
│ ├── type: string (required)
│ ├── path: string (required)
│ ├── compute: string (optional) ← Select from platform's approved list
│ └── profiles_dir: string (optional)
├── ingestion: IngestionSpec[] (optional)
├── schedule: ScheduleSpec (optional)
├── environments: EnvironmentOverride[] (optional)
└── quality: QualitySpec (optional)

Type: string Required: Yes Pattern: floe.dev/v[0-9]+

The API version for the floe.yaml schema.

apiVersion: floe.dev/v1

Type: string Required: Yes Enum: DataProduct

The resource kind. Currently only DataProduct is supported.

kind: DataProduct

Type: string Required: Yes Pattern: ^[a-z][a-z0-9-]*$ Max Length: 63

The unique name of the data product within its domain.

metadata:
name: customer-360

Type: string Required: Yes Pattern: ^[0-9]+\.[0-9]+\.[0-9]+$

Semantic version of the data product.

metadata:
version: "1.2.3"

Type: string Required: Yes Pattern: ^[a-z][a-z0-9-]*$

The domain that owns this data product. Used for namespace prefixing.

metadata:
domain: sales

Type: string Required: No Max Length: 1000

Human-readable description of the data product.

metadata:
description: "Unified customer view across all touchpoints"

Type: string Required: No Format: Email

Team or person responsible for this data product.

metadata:
owner: sales-analytics@acme.com

Type: map[string]string Required: No

Key-value labels for organization and filtering.

metadata:
labels:
team: analytics
cost-center: sales
environment: production

Type: string Required: Yes Format: OCI URI

Reference to the platform manifest OCI artifact.

platform:
ref: oci://ghcr.io/acme/platform:v1.0.0

Type: boolean Required: No Default: true

Whether to cache the platform artifact locally.

platform:
ref: oci://ghcr.io/acme/platform:v1.0.0
cache: true

Type: string Required: Yes Enum: dbt

The transform type. Currently only dbt is supported.

transforms:
- type: dbt

Type: string Required: Yes

Path to the transform source files, relative to floe.yaml.

transforms:
- type: dbt
path: models/

Type: string Required: No Default: .floe/profiles

Path to generated dbt profiles directory.

transforms:
- type: dbt
path: models/
profiles_dir: .dbt/

Type: string Required: No Default: Platform’s default compute

Select the compute engine for this transform from the platform’s approved list. This enables multi-compute pipelines where different steps can use different compute engines.

Validation: Must be a compute name from manifest.yaml plugins.compute.approved[].

# manifest.yaml (Platform Team)
plugins:
compute:
approved:
- name: duckdb
config: { threads: 8 }
- name: spark
config: { cluster: "spark-thrift.svc" }
default: duckdb
# floe.yaml (Data Engineers)
transforms:
# Heavy processing on Spark cluster
- type: dbt
path: models/staging/
compute: spark # Select from approved list
# Analytical metrics on DuckDB
- type: dbt
path: models/marts/
compute: duckdb
# Simple transforms use default
- type: dbt
path: models/seeds/
# compute: (uses platform default → duckdb)

Environment Parity: Each transform uses the SAME compute across all environments (dev/staging/prod). This is NOT for per-environment compute selection (which would cause environment drift).

Step 1: dev=Spark, staging=Spark, prod=Spark ✓ No drift
Step 2: dev=DuckDB, staging=DuckDB, prod=DuckDB ✓ No drift

Type: string Required: Yes Pattern: ^[a-z][a-z0-9_]*$

Unique name for the ingestion pipeline.

ingestion:
- name: github_events

Type: string Required: Yes Enum: dlt, airbyte

The ingestion plugin type.

ingestion:
- name: github_events
type: dlt

Type: string Required: Yes Format: {namespace}.{table}

Target Iceberg table for ingested data.

ingestion:
- name: github_events
type: dlt
destination: bronze.github_events

Type: DltConfig Required: When type: dlt

Configuration specific to dlt ingestion.

ingestion:
- name: github_events
type: dlt
destination: bronze.github_events
dlt:
source: dlt.sources.github.github_reactions
resource: issues
write_disposition: merge
incremental:
cursor_column: updated_at

Type: string Required: Yes

Python import path to the dlt source.

Type: string Required: No

Specific resource within the source.

Type: string Enum: append, replace, merge Default: append

How to write data to the destination.

Type: IncrementalConfig Required: No

Configuration for incremental loading.

Type: AirbyteConfig Required: When type: airbyte

Configuration for external Airbyte connections.

ingestion:
- name: salesforce_sync
type: airbyte
destination: bronze.salesforce
airbyte:
connection_id: "abc123-def456"

Type: map[string]string Required: No

References to Kubernetes secrets for credentials.

ingestion:
- name: github_events
type: dlt
secret_refs:
github_token: github-api-token

Type: string Required: No Format: Cron expression

Cron schedule for running the pipeline.

schedule:
cron: "0 */6 * * *" # Every 6 hours

Type: string Required: No Default: UTC

Timezone for the schedule.

schedule:
cron: "0 6 * * *"
timezone: America/New_York

Type: boolean Required: No Default: true

Whether the schedule is active.

schedule:
cron: "0 6 * * *"
enabled: false # Disable scheduling

Type: string Required: Yes Enum: development, staging, production

Environment name to override.

environments:
- name: development

Type: TransformOverride Required: No

Transform-specific overrides for this environment. Note: Per-environment compute selection is NOT allowed (would cause environment drift). Use transforms[].compute instead for per-transform compute selection.

environments:
- name: development
transforms:
# Per-environment overrides (e.g., reduced parallelism)
threads: 4
# ❌ FORBIDDEN: Per-environment compute (causes drift)
# environments:
# - name: development
# transforms:
# compute: duckdb # Different compute per env = drift
# - name: production
# transforms:
# compute: snowflake # "Works in dev, fails in prod"

Type: ScheduleOverride Required: No

Schedule overrides for this environment.

environments:
- name: development
schedule:
enabled: false # No scheduling in dev

Type: integer Required: No Default: From platform manifest Range: 0-100

Minimum test coverage percentage.

quality:
minimum_coverage: 80

Type: string[] Required: No Default: From platform manifest

Tests required for all models.

quality:
required_tests:
- not_null
- unique

apiVersion: floe.dev/v1
kind: DataProduct
metadata:
name: customer-360
version: "3.2.1"
domain: sales
description: "Unified customer view across all touchpoints"
owner: sales-analytics@acme.com
labels:
team: analytics
cost-center: sales
platform:
ref: oci://ghcr.io/acme/platform:v1.0.0
transforms:
# Heavy processing on Spark (large datasets)
- type: dbt
path: models/staging/
compute: spark # Select from platform's approved list
# Analytical metrics on DuckDB (smaller result set)
- type: dbt
path: models/marts/
compute: duckdb
# Seeds use platform default (no compute specified)
- type: dbt
path: models/seeds/
ingestion:
- name: salesforce_accounts
type: dlt
destination: bronze.salesforce_accounts
dlt:
source: dlt.sources.salesforce.salesforce_source
resource: accounts
write_disposition: merge
incremental:
cursor_column: last_modified_date
secret_refs:
salesforce_token: salesforce-api-token
- name: zendesk_tickets
type: dlt
destination: bronze.zendesk_tickets
dlt:
source: dlt.sources.zendesk.zendesk_support
resource: tickets
write_disposition: append
schedule:
cron: "0 */6 * * *"
timezone: UTC
environments:
- name: development
schedule:
enabled: false
- name: production
quality:
minimum_coverage: 100
quality:
minimum_coverage: 80
required_tests:
- not_null
- unique

The complete JSON Schema for floe.yaml is generated from Pydantic models. The public CLI does not currently expose a schema export command; use Python from the repository when you need to inspect the current schema during alpha:

import json
from floe_core.schemas.floe_spec import FloeSpec
print(json.dumps(FloeSpec.model_json_schema(), indent=2))

Alpha status: the root floe validate command exists as a data-team stub and is not yet the supported schema-validation path for users. For the current alpha, inspect the checked-in Customer 360 floe.yaml and run the demo artifact validation path documented in Build Your First Data Product.

packages/floe-core/src/floe_core/schemas/
├── floe_spec.py # Pydantic models
├── floe_yaml_schema.json # Generated JSON Schema
└── __init__.py

Configure your IDE to use the JSON Schema for validation:

VS Code (settings.json):

{
"yaml.schemas": {
"https://floe.dev/schemas/floe-yaml-v1.json": ["floe.yaml", "floe.yml"]
}
}

JetBrains IDEs:

Settings > Languages & Frameworks > Schemas and DTDs > JSON Schema Mappings
Add: https://floe.dev/schemas/floe-yaml-v1.json → floe.yaml

Beyond schema validation, the following rules are enforced at compile time:

RuleDescriptionError
domain_namespace_matchDomain must match catalog namespaceDomainMismatchError
version_semverVersion must be valid semverInvalidVersionError
transform_path_existsTransform path must existPathNotFoundError
platform_ref_resolvablePlatform OCI ref must be pullablePlatformNotFoundError
secret_refs_existSecret refs must exist in clusterSecretNotFoundError
naming_conventionModel names must match platform patternNamingViolationError
compute_in_approved_listTransform compute must be in platform’s approved listInvalidComputeError

FieldDefault ValueSource
platform.cachetrueBuilt-in
transforms[].profiles_dir.floe/profilesBuilt-in
transforms[].computeplugins.compute.defaultPlatform manifest
schedule.timezoneUTCBuilt-in
schedule.enabledtrueBuilt-in
quality.*Platform manifestInherited