Catalog Reconciliation Guide
This guide covers procedures for detecting and managing orphaned tables in Iceberg catalogs.
Overview
Section titled “Overview”Orphaned tables occur when the catalog and storage state diverge:
| Scenario | Result |
|---|---|
| Table dropped from catalog but files remain in storage | Storage orphan |
| Compile/deploy fails after table creation | Catalog orphan |
| Metadata corruption or incomplete transactions | Drift |
| Manual interventions outside floe | Unknown state |
Impact
Section titled “Impact”- Storage costs: Orphaned files consume storage indefinitely
- Governance risk: Untracked data may contain PII
- Query confusion: Stale tables appear in discovery
- Quota pressure: Namespace quotas count orphaned tables
Orphan Detection
Section titled “Orphan Detection”Types of Orphans
Section titled “Types of Orphans”| Type | Definition | Detection Method |
|---|---|---|
| Storage Orphan | Files in storage without catalog entry | Compare storage paths to catalog |
| Catalog Orphan | Catalog entry pointing to missing files | Validate storage location exists |
| Metadata Orphan | Table with corrupted/incomplete metadata | Metadata validation scan |
| Namespace Orphan | Empty namespace with no tables | List namespaces without tables |
Manual Detection (CLI)
Section titled “Manual Detection (CLI)”# List all tables in a namespacefloe catalog tables --namespace sales.gold
# Validate table accessibility (checks storage + metadata)floe catalog validate --namespace sales.gold
# Output:TABLE STATUS ISSUEsales.gold.customers valid -sales.gold.orders orphan Storage path not found: s3://bucket/orders/sales.gold.legacy_dim orphan Not in registered products
# List orphaned tablesfloe catalog orphans --namespace sales.gold
# Output:NAMESPACE TABLE TYPE LAST_ACCESSsales.gold orders storage_missing 2024-06-15sales.gold legacy_dim unregistered 2024-01-03Automated Detection (Scheduled Job)
Section titled “Automated Detection (Scheduled Job)”Configure a reconciliation job in manifest.yaml:
reconciliation: enabled: true schedule: "0 2 * * 0" # Weekly at 2 AM Sunday namespaces: - "*" # All namespaces, or list specific ones actions: - detect # Required: always detect first # - remediate # Optional: auto-cleanup (use with caution) notifications: slack: "#data-platform-alerts" email: platform-team@acme.com thresholds: orphan_count_warn: 10 orphan_count_fail: 50 orphan_size_gb_warn: 100Reconciliation Procedures
Section titled “Reconciliation Procedures”Procedure 1: Storage Orphan Cleanup
Section titled “Procedure 1: Storage Orphan Cleanup”Scenario: Files exist in object storage without corresponding catalog entry.
Steps:
-
Identify orphaned paths
Terminal window # List storage paths not in catalogfloe catalog orphans --type storage --namespace sales.gold# Output:PATH SIZE_GB LAST_MODIFIEDs3://bucket/sales/gold/old_table/ 12.5 2024-01-15s3://bucket/sales/gold/test_data/ 0.3 2024-05-20 -
Review and backup (optional)
Terminal window # Backup to quarantine locationaws s3 cp --recursive s3://bucket/sales/gold/old_table/ \s3://bucket-quarantine/sales/gold/old_table/ -
Delete orphaned files
Terminal window # Dry run firstfloe catalog cleanup --type storage --namespace sales.gold --dry-run# Execute cleanupfloe catalog cleanup --type storage --namespace sales.gold --confirm -
Verify
Terminal window floe catalog orphans --type storage --namespace sales.gold# Expected: No orphans found
Procedure 2: Catalog Orphan Cleanup
Section titled “Procedure 2: Catalog Orphan Cleanup”Scenario: Catalog entry points to non-existent storage location.
Steps:
-
Identify orphaned entries
Terminal window floe catalog orphans --type catalog --namespace sales.gold# Output:TABLE LOCATION ERRORsales.gold.orders s3://bucket/sales/gold/orders/ StorageNotFound -
Attempt recovery (if data exists elsewhere)
Terminal window # Check if data was movedaws s3 ls s3://bucket/sales/gold/ | grep orders# If found at different path, update catalogfloe catalog repair --table sales.gold.orders \--new-location s3://bucket/sales/gold/orders_v2/ -
Drop orphaned entry (if data is truly lost)
Terminal window # Dry runfloe catalog drop --table sales.gold.orders --dry-run# Executefloe catalog drop --table sales.gold.orders --confirm
Procedure 3: Full Namespace Reconciliation
Section titled “Procedure 3: Full Namespace Reconciliation”Scenario: Complete reconciliation of a namespace after incident.
# 1. Generate reconciliation reportfloe catalog reconcile --namespace sales --report-only
# Output:RECONCILIATION REPORT: sales============================Namespaces scanned: 3 (sales.bronze, sales.silver, sales.gold)Tables validated: 47Valid tables: 42Orphaned tables: 5 - Storage orphans: 2 - Catalog orphans: 2 - Metadata orphans: 1
Recommended actions: floe catalog cleanup --namespace sales --action delete-storage --count 2 floe catalog cleanup --namespace sales --action drop-catalog --count 2 floe catalog repair --namespace sales --action fix-metadata --count 1
Total storage to reclaim: 45.2 GB
# 2. Execute reconciliation (with dry-run)floe catalog reconcile --namespace sales --dry-run
# 3. Execute reconciliationfloe catalog reconcile --namespace sales --confirmPrevention Strategies
Section titled “Prevention Strategies”1. Transactional Table Creation
Section titled “1. Transactional Table Creation”Always create tables within floe’s compile/deploy flow:
# floe.yaml - Tables created through proper flowoutput_ports: - name: customers table: sales.gold.customers # Table creation is atomic with namespace registration2. Soft Deletes Before Hard Deletes
Section titled “2. Soft Deletes Before Hard Deletes”Mark tables for deletion before removing:
deprecation: tables: - name: sales.gold.old_customers sunset_date: 2025-03-01 replacement: sales.gold.customers_v23. Compile-Time Validation
Section titled “3. Compile-Time Validation”Enable orphan detection during compile:
compile: validations: check_orphaned_tables: true # Fail if orphans found orphan_threshold: 0 # Zero tolerance4. Ownership Tracking
Section titled “4. Ownership Tracking”All tables must belong to a registered data product:
# Check table ownershipfloe catalog ownership --table sales.gold.customers
# Output:TABLE PRODUCT DOMAIN OWNERsales.gold.customers customer-360 sales sales-analytics@acme.com
# Tables without ownership are flagged as potential orphansCatalogPlugin Interface
Section titled “CatalogPlugin Interface”The CatalogPlugin interface provides methods for orphan detection and reconciliation:
@dataclassclass OrphanedTable: """An orphaned table detected during reconciliation.""" namespace: str table_name: str orphan_type: Literal["storage", "catalog", "metadata", "unregistered"] location: str | None size_bytes: int | None last_modified: datetime | None error_message: str | None
@dataclassclass ReconciliationResult: """Result of a catalog reconciliation operation.""" namespace: str tables_scanned: int orphans_found: list[OrphanedTable] orphans_remediated: list[str] storage_reclaimed_bytes: int errors: list[str] dry_run: bool
class CatalogPlugin(ABC): # ... existing methods ...
@abstractmethod def list_orphaned_tables( self, namespace: str, orphan_types: list[str] | None = None, ) -> list[OrphanedTable]: """Find tables that are orphaned in the given namespace.
Orphan types: - storage: Files in storage without catalog entry - catalog: Catalog entry with missing storage - metadata: Table with corrupted metadata - unregistered: Table not owned by any data product
Args: namespace: Namespace to scan orphan_types: Types to check (default: all)
Returns: List of orphaned tables """ pass
@abstractmethod def reconcile_catalog( self, namespace: str, dry_run: bool = True, actions: list[str] | None = None, ) -> ReconciliationResult: """Reconcile catalog with storage state.
Actions: - delete-storage: Remove orphaned storage files - drop-catalog: Drop orphaned catalog entries - fix-metadata: Repair corrupted metadata
Args: namespace: Namespace to reconcile dry_run: If True, report only without making changes actions: Actions to take (default: report only)
Returns: ReconciliationResult with summary and details """ pass
@abstractmethod def validate_table_health( self, table_identifier: str, ) -> tuple[bool, str]: """Validate that a table is healthy and accessible.
Checks: - Catalog entry exists - Storage location accessible - Metadata is valid and readable - At least one snapshot exists
Args: table_identifier: Full table identifier
Returns: Tuple of (is_healthy, message) """ passMonitoring
Section titled “Monitoring”Prometheus Metrics
Section titled “Prometheus Metrics”| Metric | Type | Labels | Description |
|---|---|---|---|
floe_catalog_orphaned_tables | Gauge | namespace, type | Count of orphaned tables |
floe_catalog_orphaned_bytes | Gauge | namespace | Storage bytes in orphans |
floe_catalog_reconciliation_duration_seconds | Histogram | namespace | Reconciliation job duration |
floe_catalog_reconciliation_errors_total | Counter | namespace, error_type | Reconciliation errors |
Alert Rules
Section titled “Alert Rules”groups: - name: catalog-health rules: - alert: CatalogOrphansDetected expr: floe_catalog_orphaned_tables > 10 for: 24h labels: severity: warning annotations: summary: "{{ $value }} orphaned tables in {{ $labels.namespace }}"
- alert: CatalogOrphanStorageHigh expr: floe_catalog_orphaned_bytes > 100 * 1024 * 1024 * 1024 # 100GB for: 1h labels: severity: warning annotations: summary: "{{ $value | humanize }} orphaned storage"
- alert: CatalogReconciliationFailed expr: increase(floe_catalog_reconciliation_errors_total[1h]) > 0 labels: severity: critical annotations: summary: "Catalog reconciliation failed in {{ $labels.namespace }}"OpenLineage Events
Section titled “OpenLineage Events”Reconciliation jobs emit lineage events:
{ "eventType": "COMPLETE", "job": { "name": "catalog.reconciliation.sales" }, "run": { "facets": { "catalogReconciliation": { "namespace": "sales", "tablesScanned": 47, "orphansFound": 5, "orphansRemediated": 3, "storageReclaimedBytes": 48576000000 } } }}Governance Policies
Section titled “Governance Policies”Data Retention
Section titled “Data Retention”Orphaned data is subject to governance:
governance: orphan_handling: quarantine_days: 30 # Days to keep in quarantine before deletion require_approval: true # Require manual approval for deletion audit_deletions: true # Log all deletions to audit trail pii_scan_before_delete: true # Scan for PII before deletionAccess Control
Section titled “Access Control”Reconciliation operations require elevated permissions:
| Operation | Required Role |
|---|---|
list_orphaned_tables | data_engineer |
reconcile_catalog (dry_run) | data_engineer |
reconcile_catalog (execute) | platform_admin |
delete storage | platform_admin |