Skip to content

dbt-nexus Documentation

Reduce Recursive CTE Performance Issues

Reduce Recursive CTE Performance Issues¶

Problem Statement¶

The current identity resolution implementation uses a recursive CTE that has O(n²) to O(n³) complexity, causing:

Memory exhaustion with large datasets (42M+ identifiers)
Exponential intermediate result growth during recursive expansion
Poor scalability as data volume increases

Current Implementation Issues¶

The recursive CTE in nexus_resolved_person_identifiers.sql:

Starts with every identifier as a potential root (42M starting points)
Recursively joins to find connected identifiers through edges
Creates cartesian product-like explosion in intermediate results
Limited to 2 recursion levels but still causes memory issues

Proposed Solutions¶

1. Union-Find Algorithm (Priority: High)¶

Complexity: O(n log n) with path compression
Approach: Iterative graph traversal instead of recursive
Benefits: Near-linear scaling, deterministic memory usage

2. Hash-Based Clustering (Priority: High)¶

Complexity: O(n) - truly linear
Approach: Pre-compute identifier signatures for grouping
Benefits: Fastest for exact matches, easily parallelizable

3. Incremental Processing (Priority: Medium)¶

Complexity: O(Δn) where Δn = new records only
Approach: Only process new/changed identifiers
Benefits: Daily runs process thousands vs millions of records

4. Approximate Clustering (Priority: Low)¶

Complexity: O(n) but approximate results
Approach: Locality Sensitive Hashing for fuzzy matching
Benefits: Linear scaling for fuzzy name/address matching

Implementation Strategy¶

Phase 1: Exact Matching (Linear)¶

Replace recursive CTE with hash-based exact matching:

Email normalization: lower(trim(email))
Phone normalization: regexp_replace(phone, '[^0-9]', '')
Fast hash joins instead of recursive traversal

Phase 2: Incremental Updates¶

Convert to incremental materialization:

Only process new identifiers daily
Merge with existing resolved identifiers
Dramatically reduce processing volume

Phase 3: Fuzzy Matching (Bounded)¶

Implement controlled fuzzy matching:

Only on unmatched identifiers from Phase 1
Process in smaller batches (10K records)
Use LSH for approximate clustering

Expected Performance Gains¶

Current: O(n²) - 42M identifiers → memory errors
Linear approach: O(n) - 42M identifiers → ~42M operations
Incremental: O(Δn) - Only new identifiers → thousands daily

Technical Requirements¶

New macros for linear clustering algorithms
Incremental materialization strategy
Snowflake clustering keys on identifier_value
Batch processing for large datasets
Monitoring for performance regression testing

Success Criteria¶

No memory errors with full dataset (25M+ contracts)
Processing time reduced from hours to minutes
Incremental daily runs complete in <30 minutes
Maintains same identity resolution accuracy
Scales linearly with data growth

Memory errors in production identity resolution
Poor performance with large contract datasets
Need for efficient incremental processing