ADR-R0001: Use Cube for Semantic/Consumption Layer

Status

Accepted

Context

floe provides a complete data pipeline stack:

Transformation: dbt for SQL models
Orchestration: Dagster for scheduling and dependencies
Storage: Apache Iceberg for ACID tables
Catalog: Apache Polaris for metadata management

However, the stack lacks a consumption layer—an API through which downstream applications, BI tools, and AI agents can query the transformed data. Without this:

Users must connect directly to data warehouses, bypassing governance
No caching layer means expensive repeated queries
No unified API for diverse consumers (REST, GraphQL, SQL)
No semantic model to translate business concepts to physical tables
AI agents lack a structured interface for data queries

Key requirements for a consumption layer:

API-first architecture (REST, GraphQL, SQL wire protocol)
Semantic modeling on top of dbt models
Caching and pre-aggregations for performance
Namespace-based row-level security
Integration with OpenTelemetry and OpenLineage
Support for AI/agent interfaces (MCP)

Technologies considered:

Cube - Open-source semantic layer, API-first, dbt integration, multi-tenant
dbt Semantic Layer (MetricFlow) - dbt-native, but tightly coupled to dbt Cloud
LookML/Looker - Powerful but proprietary, heavy vendor lock-in
Custom API layer - Full control but significant development effort
Trino/Presto - Query engine, not semantic layer

Decision

Use Cube as the semantic/consumption layer for floe.

Cube will be implemented as the plugin plugins/floe-semantic-cube/ that:

Syncs dbt models to Cube cubes via the cube_dbt package
Provides REST, GraphQL, and SQL APIs for data consumption
Implements caching via Cube Store pre-aggregations
Enforces row-level security using namespace context
Exposes MCP server interface for AI agent queries
Emits OpenLineage events for query lineage

Consequences

Positive

Complete stack: floe becomes end-to-end (ingest → transform → store → serve)
Native dbt integration: cube_dbt package loads dbt manifest directly
Universal APIs: REST, GraphQL, and Postgres-compatible SQL serve any consumer
Performance: Cube Store pre-aggregations provide sub-second query response
Data isolation: Built-in row-level security with queryRewrite and security context
AI-ready: MCP server and AI API enable agent-based analytics
Open source: Cube Core is Apache 2.0 licensed, aligns with floe licensing
BI connectivity: 40+ native integrations (Tableau, Metabase, Superset, etc.)

Negative

Additional complexity: New component to deploy, configure, monitor
Resource requirements: Cube Store requires persistent storage for pre-aggregations
Learning curve: Team must learn Cube data modeling concepts
Versioning: Must coordinate Cube, dbt, and Dagster versions

Neutral

Cube Cloud available: Managed option exists if self-hosting becomes burdensome
Community Helm charts: No official Helm chart, but community options exist
Instrumentation needed: OpenTelemetry/OpenLineage integration requires custom work

Architecture Integration

Package Structure

floe/
├── floe-core/                       # Schema, validation, interfaces (ABCs)
├── floe-cli/                        # Developer CLI
├── floe-dbt/                        # Transformation (enforced)
├── floe-iceberg/                    # Storage (enforced)
│
└── plugins/                         # Pluggable components
    ├── floe-orchestrator-dagster/   # Orchestration
    ├── floe-catalog-polaris/        # Catalog
    └── floe-semantic-cube/          # Consumption (semantic layer)
        ├── src/
        │   ├── plugin.py            # Implements SemanticLayerPlugin ABC
        │   ├── model_sync.py        # Sync dbt models → Cube cubes
        │   ├── security.py          # Row-level security context
        │   └── lineage.py           # OpenLineage emission
        ├── chart/                   # Helm chart for Cube deployment
        └── pyproject.toml           # Entry point registration

Note: floe-cube has been renamed to plugins/floe-semantic-cube/ to follow the plugin pattern documented in Plugin System.

Data Flow

floe.yaml + manifest.yaml
    │
    ▼
floe-core (compile, enforce)
    │
    ▼
OrchestratorPlugin (e.g., Dagster) ──► floe-dbt (transform)
    │                                       │
    │                                       ▼
    │                                  dbt models
    │                                       │
    │                                       ▼
    │                             floe-iceberg (store)
    │                                       │
    │                                       ▼
    │                             CatalogPlugin (e.g., Polaris)
    │                                       │
    ▼                                       ▼
SemanticLayerPlugin (e.g., Cube) ◄── dbt manifest.json
    │
    ▼
REST / GraphQL / SQL APIs
    │
    ▼
BI Tools, AI Agents, Applications

Schema Extension

# floe.yaml - consumption section
consumption:
  enabled: true
  cube:
    port: 4000
    api_secret_ref: "cube-api-secret"
    pre_aggregations:
      refresh_schedule: "*/30 * * * *"
    security:
      row_level: true
      namespace_column: "namespace"

Deployment Components

Component	Purpose	Scaling
Cube API	Handle incoming queries	Horizontal (2+ replicas)
Cube Refresh Worker	Build pre-aggregations	Single replica
Cube Store Router	Route queries	Single replica
Cube Store Workers	Execute cached queries	Horizontal (2+ replicas)

References

Cube Documentation
Cube + dbt Integration
cube_dbt Package - Python package for dbt manifest sync
cube_dbt Documentation
Cube Multitenancy
Cube Row-Level Security
Cube Deployment
Community Helm Charts (no official chart):
- narioinc/cube-helm - Scalable cluster deployment
- gadsme/cube - Artifact Hub
- OpstimizeIcarus/cubejs-helm-charts
ADR-0009: dbt Owns SQL