Skip to content

Storage Integration Architecture

This document describes how floe integrates with object storage for Apache Iceberg tables.

floe enforces Apache Iceberg as the table format. Iceberg tables are stored on object storage, with metadata managed by the catalog (Polaris).

┌─────────────────────────────────────────────────────────────┐
│ Object Storage │
│ ┌─────────────────────────────────────────────────────────┐│
│ │ s3://floe-warehouse/iceberg/ ││
│ │ ├── bronze.db/ ││
│ │ │ └── customers/ ││
│ │ │ ├── metadata/ ││
│ │ │ │ ├── v1.metadata.json ││
│ │ │ │ └── snap-xxx.avro ││
│ │ │ └── data/ ││
│ │ │ └── part-00000.parquet ││
│ │ ├── silver.db/ ││
│ │ └── gold.db/ ││
│ └─────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────┘
▲ ▲
│ Metadata │ Data
│ │
┌──────────┴──────────┐ ┌───────────┴───────────┐
│ Polaris Catalog │ │ Compute (dbt/dlt) │
│ (REST Catalog) │ │ (via Iceberg SDK) │
└─────────────────────┘ └───────────────────────┘
StorageUse CaseAuthentication
MinIOLocal evaluation and self-hosted S3-compatible endpointAccess Key / Secret Key
AWS S3Validated AWS object-storage backendIRSA (recommended) or IAM User
Google Cloud StorageProduction on GCPWorkload Identity (recommended) or SA Key
Azure Blob / ADLS Gen2Production on AzureManaged Identity (recommended) or SP

MinIO is the local evaluation object store used by Floe chart and demo paths:

  • S3-compatible API (works with Iceberg’s S3 file IO)
  • Included in the floe-platform Helm chart
  • Easy local setup via Docker or Kubernetes
  • Supports versioning for backup/recovery
manifest.yaml
storage:
type: minio
warehouse_path: s3://floe-warehouse/iceberg
config:
endpoint: http://minio.floe-platform:9000
access_key_ref: minio-credentials
secret_key_ref: minio-credentials

For production on AWS, use IAM Roles for Service Accounts (IRSA):

manifest.yaml
storage:
type: s3
warehouse_path: s3://my-company-data-lake/floe/iceberg
config:
region: us-east-1
auth: irsa # Uses pod's service account

IAM Policy Required:

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-company-data-lake/floe/*",
"arn:aws:s3:::my-company-data-lake"
]
}
]
}

For production on GCP, use Workload Identity:

manifest.yaml
storage:
type: gcs
warehouse_path: gs://my-company-data-lake/floe/iceberg
config:
project: my-gcp-project
auth: workload_identity

For production on Azure, use Managed Identity:

manifest.yaml
storage:
type: azure
warehouse_path: abfss://data@mystorageaccount.dfs.core.windows.net/floe/iceberg
config:
auth: managed_identity

Iceberg tables follow a consistent directory structure:

{warehouse_path}/
├── {database}.db/
│ └── {table}/
│ ├── metadata/
│ │ ├── v1.metadata.json # Table metadata (schema, partitions, snapshots)
│ │ ├── v2.metadata.json # Updated metadata after writes
│ │ ├── snap-{id}.avro # Snapshot manifests
│ │ └── {manifest-id}.avro # Manifest files
│ └── data/
│ ├── {partition}/ # Partition directories (if partitioned)
│ │ └── {file-id}.parquet # Data files (Parquet format)
│ └── {file-id}.parquet # Data files (if unpartitioned)

The storage layout follows the data architecture pattern specified in the Manifest:

PatternDatabase NamesExample Path
Medallionbronze, silver, golds3://warehouse/iceberg/bronze.db/customers/
Kimballstaging, dimensions, factss3://warehouse/iceberg/dimensions.db/dim_customer/
Data Vaultraw_vault, business_vaults3://warehouse/iceberg/raw_vault.db/hub_customer/

For enhanced security, Polaris can vend short-lived credentials for table access:

┌─────────────────┐ 1. Request credentials ┌─────────────────┐
│ Job Pod │ ─────────────────────────────► │ Polaris │
│ (dbt/dlt) │ │ Catalog │
└─────────────────┘ └────────┬────────┘
│ │
│ 2. Short-lived STS credentials │
│◄──────────────────────────────────────────────────┘
│ 3. Access storage with temporary credentials
┌─────────────────┐
│ Object Storage │
│ (S3/GCS/Azure) │
└─────────────────┘

Benefits:

  • No long-lived credentials in job pods
  • Credentials scoped to specific tables
  • Automatic expiration (typically 1 hour)
  • Audit trail via Polaris

Polaris Configuration:

# In CatalogPlugin configuration
catalog:
type: polaris
config:
credential_vending: true
credential_ttl: 3600 # 1 hour

Each compute engine connects to the Iceberg catalog differently. All table operations go through the catalog to ensure consistent metadata management.

ComputeCatalog Connection Method
DuckDBATTACH statement with Iceberg REST endpoint
SparkSparkCatalog configuration in spark-defaults.conf
SnowflakeExternal volume + catalog integration (managed by Snowflake)

When using DuckDB as the compute engine with Polaris as the catalog:

1. dbt pre-hook executes ATTACH to Polaris
2. DuckDB establishes REST connection to Polaris
3. Polaris vends short-lived credentials for object storage
4. dbt model SQL executes (CREATE TABLE AS SELECT)
5. DuckDB writes Parquet files to object storage
6. DuckDB updates table metadata via Polaris REST API
7. Polaris persists metadata to PostgreSQL

The floe-dbt package generates appropriate pre-hooks based on the compute plugin’s get_catalog_attachment_sql() method:

# Generated dbt_project.yml
on-run-start:
- "LOAD iceberg;"
- "CREATE SECRET IF NOT EXISTS polaris_secret (...)"
- "ATTACH IF NOT EXISTS 'warehouse' AS ice (TYPE iceberg, ...)"

Not all compute engines support all storage backends. The PolicyEnforcer validates compatibility at compile time.

ComputeS3/MinIOGCSAzure ADLS
DuckDB
Spark
SnowflakeN/A (uses Snowflake storage)N/AN/A

MVP Scope: S3-compatible storage only (AWS S3, MinIO).

For GCP/Azure deployments, use MinIO as the storage layer, which provides S3-compatible access for DuckDB while running on cloud-native infrastructure:

# manifest.yaml (GCP deployment with MinIO)
storage:
type: minio
warehouse_path: s3://floe-warehouse/iceberg
config:
endpoint: http://minio.floe-platform:9000
# MinIO deployed on GKE/AKS provides S3-compatible API

Native GCS and Azure ADLS support for DuckDB is pending upstream DuckDB Iceberg extension updates and will be added in a future release.

Enable versioning on the warehouse bucket for point-in-time recovery:

Terminal window
# AWS S3
aws s3api put-bucket-versioning \
--bucket my-company-data-lake \
--versioning-configuration Status=Enabled
# MinIO
mc version enable minio/floe-warehouse

Iceberg maintains table history via snapshots. Configure retention in the Manifest:

manifest.yaml
data_architecture:
iceberg:
snapshot_retention_days: 7
min_snapshots_to_keep: 5

Recovery commands:

-- List available snapshots
SELECT * FROM iceberg.bronze.customers.snapshots;
-- Query historical data
SELECT * FROM iceberg.bronze.customers FOR TIMESTAMP AS OF '2024-01-15 10:00:00';
-- Rollback to previous snapshot
ALTER TABLE iceberg.bronze.customers EXECUTE rollback_to_timestamp('2024-01-15 10:00:00');

Polaris stores catalog metadata in PostgreSQL. Include in backup strategy:

# Platform services backup
backups:
polaris_postgres:
schedule: "0 */6 * * *" # Every 6 hours
retention: 30d

Iceberg target file size affects query performance:

manifest.yaml
data_architecture:
iceberg:
target_file_size_mb: 512 # 512 MB files (default)
# Smaller for frequently updated tables
# Larger for append-only tables

Configure automatic compaction to merge small files:

manifest.yaml
data_architecture:
iceberg:
compaction:
enabled: true
min_input_files: 5
target_file_size_mb: 512
# Full storage configuration schema
storage:
type: minio | s3 | gcs | azure
warehouse_path: string # URI to Iceberg warehouse root
config:
# MinIO / S3
endpoint: string # S3-compatible endpoint (MinIO only)
region: string # AWS region
access_key_ref: string # K8s Secret reference
secret_key_ref: string # K8s Secret reference
auth: irsa | access_key # Authentication method
# GCS
project: string # GCP project ID
auth: workload_identity | service_account_key
# Azure
auth: managed_identity | service_principal