Skip to content

Storage Integration Architecture

This document describes how floe integrates with object storage for Apache Iceberg tables.

floe enforces Apache Iceberg as the table format. Iceberg tables are stored on object storage, with metadata managed by the catalog (Polaris).

┌─────────────────────────────────────────────────────────────┐
│ Object Storage │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ s3://floe-warehouse/iceberg/ ││
│ │ ├── bronze.db/ ││
│ │ │ └── customers/ ││
│ │ │ ├── metadata/ ││
│ │ │ │ ├── v1.metadata.json ││
│ │ │ │ └── snap-xxx.avro ││
│ │ │ └── data/ ││
│ │ │ └── part-00000.parquet ││
│ │ ├── silver.db/ ││
│ │ └── gold.db/ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
▲ ▲
│ Metadata │ Data
│ │
┌──────────┴──────────┐ ┌───────────┴───────────┐
│ Polaris Catalog │ │ Compute (dbt/dlt) │
│ (REST Catalog) │ │ (via Iceberg SDK) │
└─────────────────────┘ └───────────────────────┘

Iceberg table mutation is owned by floe-iceberg, not by orchestrator plugins.

Runtime orchestrators such as Dagster or Airflow coordinate execution, collect runtime outputs, and call the floe_iceberg.writer contract with Arrow tables and Iceberg identifiers. The writer owns the runtime write flow and Iceberg mutation semantics, including append/overwrite behavior and stale metadata repair. It coordinates namespace and table load/create operations through the catalog plugin rather than implementing catalog APIs directly.

Catalog and storage plugins remain injected dependencies. They provide catalog connections, FileIO support, endpoint configuration, and credential references, but they do not depend on Dagster, Airflow, or any orchestrator-specific API. The catalog plugin remains the owner and provider of catalog namespace and table APIs.

CompiledArtifacts remains secret-free. Runtime credential material flows through resolved deployment bindings and plugin-owned connection logic rather than through writer results or orchestrator logs.

The implemented alpha lane uses floe-storage-minio with the S3-compatible protocol. Provider-native AWS S3, GCS, and Azure storage plugins remain future extensions unless a deployment has explicitly validated them.

StorageUse CaseAuthentication
MinIOLocal evaluation and self-hosted S3-compatible endpointAccess Key / Secret Key
AWS S3Future/provider-native object-storage pluginIRSA (recommended) or IAM User
Google Cloud StorageFuture/provider-native object-storage pluginWorkload Identity (recommended) or SA Key
Azure Blob / ADLS Gen2Future/provider-native object-storage pluginManaged Identity (recommended) or SP

MinIO is the local evaluation object store used by Floe chart and demo paths:

  • S3-compatible API (works with Iceberg’s S3 file IO)
  • Included in the floe-platform Helm chart
  • Easy local setup in the Kind/Helm evaluation lane
  • Supports versioning for backup/recovery
manifest.yaml
storage:
type: minio
warehouse_path: s3://floe-warehouse/iceberg
config:
endpoint: http://minio.floe-platform:9000
access_key_ref: minio-credentials
secret_key_ref: minio-credentials

A provider-native AWS S3 plugin should use IAM Roles for Service Accounts (IRSA). Until that plugin is implemented and validated, the alpha path remains MinIO through the S3-compatible protocol:

# conceptual future manifest.yaml; not accepted by the current alpha registry
storage:
type: aws-s3
warehouse_path: s3://my-company-data-lake/floe/iceberg
config:
region: us-east-1
auth: irsa # Uses pod's service account

IAM Policy Required:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-company-data-lake/floe/*",
"arn:aws:s3:::my-company-data-lake"
]
}
]
}

For production on GCP, use Workload Identity:

manifest.yaml
storage:
type: gcs
warehouse_path: gs://my-company-data-lake/floe/iceberg
config:
project: my-gcp-project
auth: workload_identity

For production on Azure, use Managed Identity:

manifest.yaml
storage:
type: azure
warehouse_path: abfss://data@mystorageaccount.dfs.core.windows.net/floe/iceberg
config:
auth: managed_identity

Iceberg tables follow a consistent directory structure:

{warehouse_path}/
├── {database}.db/
│ └── {table}/
│ ├── metadata/
│ │ ├── v1.metadata.json # Table metadata (schema, partitions, snapshots)
│ │ ├── v2.metadata.json # Updated metadata after writes
│ │ ├── snap-{id}.avro # Snapshot manifests
│ │ └── {manifest-id}.avro # Manifest files
│ └── data/
│ ├── {partition}/ # Partition directories (if partitioned)
│ │ └── {file-id}.parquet # Data files (Parquet format)
│ └── {file-id}.parquet # Data files (if unpartitioned)

The storage layout follows the data architecture pattern specified in the Manifest:

PatternDatabase NamesExample Path
Medallionbronze, silver, golds3://warehouse/iceberg/bronze.db/customers/
Kimballstaging, dimensions, factss3://warehouse/iceberg/dimensions.db/dim_customer/
Data Vaultraw_vault, business_vaults3://warehouse/iceberg/raw_vault.db/hub_customer/

For enhanced security, Polaris can vend short-lived credentials for table access:

┌─────────────────┐ 1. Request credentials ┌─────────────────┐
│ Job Pod │ ─────────────────────────────► │ Polaris │
│ (dbt/dlt) │ │ Catalog │
└─────────────────┘ └────────┬────────┘
│ │
│ 2. Short-lived STS credentials │
│◄──────────────────────────────────────────────────┘
│ 3. Access storage with temporary credentials
┌─────────────────┐
│ Object Storage │
│ (S3/GCS/Azure) │
└─────────────────┘

Benefits:

  • No long-lived credentials in job pods
  • Credentials scoped to specific tables
  • Automatic expiration (typically 1 hour)
  • Audit trail via Polaris

Polaris Configuration:

# In CatalogPlugin configuration
catalog:
type: polaris
config:
credential_vending: true
credential_ttl: 3600 # 1 hour

Each compute engine connects to the Iceberg catalog differently. All table operations go through the catalog to ensure consistent metadata management.

ComputeCatalog Connection Method
DuckDBATTACH statement with Iceberg REST endpoint
SparkSparkCatalog configuration in spark-defaults.conf
SnowflakeExternal volume + catalog integration (managed by Snowflake)

When using DuckDB as the compute engine with Polaris as the catalog:

1. dbt pre-hook executes ATTACH to Polaris
2. DuckDB establishes REST connection to Polaris
3. Polaris vends short-lived credentials for object storage
4. dbt model SQL executes (CREATE TABLE AS SELECT)
5. DuckDB writes Parquet files to object storage
6. DuckDB updates table metadata via Polaris REST API
7. Polaris persists metadata to PostgreSQL

The floe-dbt package generates appropriate pre-hooks based on the compute plugin’s get_catalog_attachment_sql() method:

# Generated dbt_project.yml
on-run-start:
- "LOAD iceberg;"
- "CREATE SECRET IF NOT EXISTS polaris_secret (...)"
- "ATTACH IF NOT EXISTS 'warehouse' AS ice (TYPE iceberg, ...)"

Not all compute engines support all storage backends. The PolicyEnforcer validates compatibility at compile time.

ComputeS3/MinIOGCSAzure ADLS
DuckDB
Spark
SnowflakeN/A (uses Snowflake storage)N/AN/A

Alpha Scope: MinIO through the S3-compatible protocol.

For GCP/Azure evaluation before provider-native plugins are implemented, use MinIO as the storage layer. It provides S3-compatible access for DuckDB while running on cloud-native infrastructure:

# manifest.yaml (GCP deployment with MinIO)
storage:
type: minio
warehouse_path: s3://floe-warehouse/iceberg
config:
endpoint: http://minio.floe-platform:9000
# MinIO deployed on GKE/AKS provides S3-compatible API

Native GCS, Azure ADLS, and AWS S3 plugin support should be added as future provider-specific plugins with their own identity and credential projection contracts.

Enable versioning on the warehouse bucket for point-in-time recovery:

Terminal window
# AWS S3
aws s3api put-bucket-versioning \
--bucket my-company-data-lake \
--versioning-configuration Status=Enabled
# MinIO
mc version enable minio/floe-warehouse

Iceberg maintains table history via snapshots. Configure retention in the Manifest:

manifest.yaml
data_architecture:
iceberg:
snapshot_retention_days: 7
min_snapshots_to_keep: 5

Recovery commands:

-- List available snapshots
SELECT * FROM iceberg.bronze.customers.snapshots;
-- Query historical data
SELECT * FROM iceberg.bronze.customers FOR TIMESTAMP AS OF '2024-01-15 10:00:00';
-- Rollback to previous snapshot
ALTER TABLE iceberg.bronze.customers EXECUTE rollback_to_timestamp('2024-01-15 10:00:00');

Polaris stores catalog metadata in PostgreSQL. Include in backup strategy:

# Platform services backup
backups:
polaris_postgres:
schedule: "0 */6 * * *" # Every 6 hours
retention: 30d

Iceberg target file size affects query performance:

manifest.yaml
data_architecture:
iceberg:
target_file_size_mb: 512 # 512 MB files (default)
# Smaller for frequently updated tables
# Larger for append-only tables

Configure automatic compaction to merge small files:

manifest.yaml
data_architecture:
iceberg:
compaction:
enabled: true
min_input_files: 5
target_file_size_mb: 512
# Full storage configuration schema
storage:
type: minio
warehouse_path: string # URI to Iceberg warehouse root
config:
# MinIO via S3-compatible protocol
endpoint: string # S3-compatible endpoint
region: string # AWS region
access_key_ref: string # K8s Secret reference
secret_key_ref: string # K8s Secret reference
auth: access_key # Implemented alpha authentication method
# Future provider-native plugins add provider-owned schemas for:
# - AWS S3: IRSA or IAM access keys
# - GCS: Workload Identity or service account references
# - Azure: Managed Identity or service principal references