Storage Integration Architecture
This document describes how floe integrates with object storage for Apache Iceberg tables.
Overview
Section titled “Overview”floe enforces Apache Iceberg as the table format. Iceberg tables are stored on object storage, with metadata managed by the catalog (Polaris).
┌─────────────────────────────────────────────────────────────┐│ Object Storage ││ ┌─────────────────────────────────────────────────────────┐││ │ s3://floe-warehouse/iceberg/ │││ │ ├── bronze.db/ │││ │ │ └── customers/ │││ │ │ ├── metadata/ │││ │ │ │ ├── v1.metadata.json │││ │ │ │ └── snap-xxx.avro │││ │ │ └── data/ │││ │ │ └── part-00000.parquet │││ │ ├── silver.db/ │││ │ └── gold.db/ │││ └─────────────────────────────────────────────────────────┘│└─────────────────────────────────────────────────────────────┘ ▲ ▲ │ Metadata │ Data │ │┌──────────┴──────────┐ ┌───────────┴───────────┐│ Polaris Catalog │ │ Compute (dbt/dlt) ││ (REST Catalog) │ │ (via Iceberg SDK) │└─────────────────────┘ └───────────────────────┘Iceberg Writer Ownership
Section titled “Iceberg Writer Ownership”Iceberg table mutation is owned by floe-iceberg, not by orchestrator plugins.
Runtime orchestrators such as Dagster or Airflow coordinate execution, collect
runtime outputs, and call the floe_iceberg.writer contract with Arrow tables
and Iceberg identifiers. The writer owns the runtime write flow and Iceberg
mutation semantics, including append/overwrite behavior and stale metadata
repair. It coordinates namespace and table load/create operations through the
catalog plugin rather than implementing catalog APIs directly.
Catalog and storage plugins remain injected dependencies. They provide catalog connections, FileIO support, endpoint configuration, and credential references, but they do not depend on Dagster, Airflow, or any orchestrator-specific API. The catalog plugin remains the owner and provider of catalog namespace and table APIs.
CompiledArtifacts remains secret-free. Runtime credential material flows
through resolved deployment bindings and plugin-owned connection logic rather
than through writer results or orchestrator logs.
Object Storage Options
Section titled “Object Storage Options”The implemented alpha lane uses floe-storage-minio with the S3-compatible
protocol. Provider-native AWS S3, GCS, and Azure storage plugins remain future
extensions unless a deployment has explicitly validated them.
| Storage | Use Case | Authentication |
|---|---|---|
| MinIO | Local evaluation and self-hosted S3-compatible endpoint | Access Key / Secret Key |
| AWS S3 | Future/provider-native object-storage plugin | IRSA (recommended) or IAM User |
| Google Cloud Storage | Future/provider-native object-storage plugin | Workload Identity (recommended) or SA Key |
| Azure Blob / ADLS Gen2 | Future/provider-native object-storage plugin | Managed Identity (recommended) or SP |
MinIO Local Evaluation
Section titled “MinIO Local Evaluation”MinIO is the local evaluation object store used by Floe chart and demo paths:
- S3-compatible API (works with Iceberg’s S3 file IO)
- Included in the
floe-platformHelm chart - Easy local setup in the Kind/Helm evaluation lane
- Supports versioning for backup/recovery
storage: type: minio warehouse_path: s3://floe-warehouse/iceberg config: endpoint: http://minio.floe-platform:9000 access_key_ref: minio-credentials secret_key_ref: minio-credentialsFuture Provider-Native AWS Storage Plugin
Section titled “Future Provider-Native AWS Storage Plugin”A provider-native AWS S3 plugin should use IAM Roles for Service Accounts (IRSA). Until that plugin is implemented and validated, the alpha path remains MinIO through the S3-compatible protocol:
# conceptual future manifest.yaml; not accepted by the current alpha registrystorage: type: aws-s3 warehouse_path: s3://my-company-data-lake/floe/iceberg config: region: us-east-1 auth: irsa # Uses pod's service accountIAM Policy Required:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "s3:GetObject", "s3:PutObject", "s3:DeleteObject", "s3:ListBucket" ], "Resource": [ "arn:aws:s3:::my-company-data-lake/floe/*", "arn:aws:s3:::my-company-data-lake" ] } ]}Google Cloud Storage
Section titled “Google Cloud Storage”For production on GCP, use Workload Identity:
storage: type: gcs warehouse_path: gs://my-company-data-lake/floe/iceberg config: project: my-gcp-project auth: workload_identityAzure Blob Storage / ADLS Gen2
Section titled “Azure Blob Storage / ADLS Gen2”For production on Azure, use Managed Identity:
storage: type: azure warehouse_path: abfss://data@mystorageaccount.dfs.core.windows.net/floe/iceberg config: auth: managed_identityStorage Layout
Section titled “Storage Layout”Iceberg tables follow a consistent directory structure:
{warehouse_path}/├── {database}.db/│ └── {table}/│ ├── metadata/│ │ ├── v1.metadata.json # Table metadata (schema, partitions, snapshots)│ │ ├── v2.metadata.json # Updated metadata after writes│ │ ├── snap-{id}.avro # Snapshot manifests│ │ └── {manifest-id}.avro # Manifest files│ └── data/│ ├── {partition}/ # Partition directories (if partitioned)│ │ └── {file-id}.parquet # Data files (Parquet format)│ └── {file-id}.parquet # Data files (if unpartitioned)Naming Convention Integration
Section titled “Naming Convention Integration”The storage layout follows the data architecture pattern specified in the Manifest:
| Pattern | Database Names | Example Path |
|---|---|---|
| Medallion | bronze, silver, gold | s3://warehouse/iceberg/bronze.db/customers/ |
| Kimball | staging, dimensions, facts | s3://warehouse/iceberg/dimensions.db/dim_customer/ |
| Data Vault | raw_vault, business_vault | s3://warehouse/iceberg/raw_vault.db/hub_customer/ |
Credential Vending
Section titled “Credential Vending”For enhanced security, Polaris can vend short-lived credentials for table access:
┌─────────────────┐ 1. Request credentials ┌─────────────────┐│ Job Pod │ ─────────────────────────────► │ Polaris ││ (dbt/dlt) │ │ Catalog │└─────────────────┘ └────────┬────────┘ │ │ │ 2. Short-lived STS credentials │ │◄──────────────────────────────────────────────────┘ │ │ 3. Access storage with temporary credentials ▼┌─────────────────┐│ Object Storage ││ (S3/GCS/Azure) │└─────────────────┘Benefits:
- No long-lived credentials in job pods
- Credentials scoped to specific tables
- Automatic expiration (typically 1 hour)
- Audit trail via Polaris
Polaris Configuration:
# In CatalogPlugin configurationcatalog: type: polaris config: credential_vending: true credential_ttl: 3600 # 1 hourCompute Engine Catalog Integration
Section titled “Compute Engine Catalog Integration”Each compute engine connects to the Iceberg catalog differently. All table operations go through the catalog to ensure consistent metadata management.
| Compute | Catalog Connection Method |
|---|---|
| DuckDB | ATTACH statement with Iceberg REST endpoint |
| Spark | SparkCatalog configuration in spark-defaults.conf |
| Snowflake | External volume + catalog integration (managed by Snowflake) |
DuckDB + Polaris Data Flow
Section titled “DuckDB + Polaris Data Flow”When using DuckDB as the compute engine with Polaris as the catalog:
1. dbt pre-hook executes ATTACH to Polaris ↓2. DuckDB establishes REST connection to Polaris ↓3. Polaris vends short-lived credentials for object storage ↓4. dbt model SQL executes (CREATE TABLE AS SELECT) ↓5. DuckDB writes Parquet files to object storage ↓6. DuckDB updates table metadata via Polaris REST API ↓7. Polaris persists metadata to PostgreSQLThe floe-dbt package generates appropriate pre-hooks based on the compute plugin’s get_catalog_attachment_sql() method:
# Generated dbt_project.ymlon-run-start: - "LOAD iceberg;" - "CREATE SECRET IF NOT EXISTS polaris_secret (...)" - "ATTACH IF NOT EXISTS 'warehouse' AS ice (TYPE iceberg, ...)"Compute-Storage Compatibility Matrix
Section titled “Compute-Storage Compatibility Matrix”Not all compute engines support all storage backends. The PolicyEnforcer validates compatibility at compile time.
| Compute | S3/MinIO | GCS | Azure ADLS |
|---|---|---|---|
| DuckDB | ✅ | ❌ | ❌ |
| Spark | ✅ | ✅ | ✅ |
| Snowflake | N/A (uses Snowflake storage) | N/A | N/A |
Alpha Scope: MinIO through the S3-compatible protocol.
For GCP/Azure evaluation before provider-native plugins are implemented, use MinIO as the storage layer. It provides S3-compatible access for DuckDB while running on cloud-native infrastructure:
# manifest.yaml (GCP deployment with MinIO)storage: type: minio warehouse_path: s3://floe-warehouse/iceberg config: endpoint: http://minio.floe-platform:9000 # MinIO deployed on GKE/AKS provides S3-compatible APINative GCS, Azure ADLS, and AWS S3 plugin support should be added as future provider-specific plugins with their own identity and credential projection contracts.
Backup Strategies
Section titled “Backup Strategies”Object Storage Versioning
Section titled “Object Storage Versioning”Enable versioning on the warehouse bucket for point-in-time recovery:
# AWS S3aws s3api put-bucket-versioning \ --bucket my-company-data-lake \ --versioning-configuration Status=Enabled
# MinIOmc version enable minio/floe-warehouseIceberg Time Travel
Section titled “Iceberg Time Travel”Iceberg maintains table history via snapshots. Configure retention in the Manifest:
data_architecture: iceberg: snapshot_retention_days: 7 min_snapshots_to_keep: 5Recovery commands:
-- List available snapshotsSELECT * FROM iceberg.bronze.customers.snapshots;
-- Query historical dataSELECT * FROM iceberg.bronze.customers FOR TIMESTAMP AS OF '2024-01-15 10:00:00';
-- Rollback to previous snapshotALTER TABLE iceberg.bronze.customers EXECUTE rollback_to_timestamp('2024-01-15 10:00:00');Metadata Backup
Section titled “Metadata Backup”Polaris stores catalog metadata in PostgreSQL. Include in backup strategy:
# Platform services backupbackups: polaris_postgres: schedule: "0 */6 * * *" # Every 6 hours retention: 30dPerformance Tuning
Section titled “Performance Tuning”Object Size Optimization
Section titled “Object Size Optimization”Iceberg target file size affects query performance:
data_architecture: iceberg: target_file_size_mb: 512 # 512 MB files (default) # Smaller for frequently updated tables # Larger for append-only tablesCompaction
Section titled “Compaction”Configure automatic compaction to merge small files:
data_architecture: iceberg: compaction: enabled: true min_input_files: 5 target_file_size_mb: 512Configuration Schema
Section titled “Configuration Schema”# Full storage configuration schemastorage: type: minio warehouse_path: string # URI to Iceberg warehouse root config: # MinIO via S3-compatible protocol endpoint: string # S3-compatible endpoint region: string # AWS region access_key_ref: string # K8s Secret reference secret_key_ref: string # K8s Secret reference auth: access_key # Implemented alpha authentication method
# Future provider-native plugins add provider-owned schemas for:# - AWS S3: IRSA or IAM access keys# - GCS: Workload Identity or service account references# - Azure: Managed Identity or service principal referencesReferences
Section titled “References”- Apache Iceberg Documentation
- Iceberg Table Spec
- Polaris Catalog
- MinIO Documentation
- ADR-0018: Opinionation Boundaries - Iceberg enforcement
- Platform Services - MinIO deployment