Overview
Overview¶
The Nexus identity resolution algorithm transforms source-specific identifiers
into unified entities through a multi-step process. This document explains how
the algorithm works using the v0.3.0 entity-centric architecture, where all
entity types (persons, groups, custom entities) flow through a unified pipeline
before being resolved separately by entity_type
for optimal performance.
This explanation covers the complete pipeline from raw source data to final entity tables, with real performance metrics and data examples from production usage.
Algorithm Steps¶
Step 0: Source Identifier Formatting (*_entity_identifiers
)¶
Purpose: Transform source-specific data into the standardized Nexus identifier format required by the algorithm. All you have to do to include a new source in identity resolution is format your identifier data correctly here.
Process: Each source system creates a unified *_entity_identifiers
table
that combines all entity types (persons, groups, etc.) with an entity_type
field. The unpivot_identifiers
macro simplifies this process by automatically
converting columnar identifier data into the required row-based format.
Required Schema:
entity_identifier_id -- Unique identifier (ent_idfr_ prefix)
event_id -- Links identifiers to specific events
edge_id -- Groups identifiers that should be connected (typically event_id)
entity_type -- Type of entity: 'person', 'group', etc.
identifier_type -- The type of identifier (e.g., 'email', 'domain', 'location_id')
identifier_value -- The actual identifier value
role -- The entity's role in the event (optional)
occurred_at -- Event timestamp (for metadata)
source -- Source system name
Example Implementation (Gmail persons and groups):
-- models/sources/gmail/gmail_entity_identifiers.sql
{{ dbt_utils.union_relations([
ref('gmail_message_person_identifiers'),
ref('gmail_message_group_identifiers')
]) }}
This unions intermediate models that extract identifiers:
-- gmail_message_person_identifiers.sql (excerpt)
SELECT
{{ nexus.create_nexus_id('entity_identifier', ['event_id', 'sender.email', "'person'", "'sender'"]) }} as entity_identifier_id,
event_id,
event_id as edge_id,
'person' as entity_type,
'email' as identifier_type,
sender.email as identifier_value,
'sender' as role,
occurred_at,
'gmail' as source
FROM gmail_message_events
WHERE sender.email IS NOT NULL
Output Data (unified entity identifiers):
entity_identifier_id | event_id | edge_id | entity_type | identifier_type | identifier_value | role | occurred_at | source
ent_idfr_a1b2c3d4... | evt_gmail_123... | evt_gmail_123... | person | email | john@company.com | sender | 2025-01-20 | gmail
ent_idfr_e5f6g7h8... | evt_gmail_123... | evt_gmail_123... | group | domain | company.com | sender_domain | 2025-01-20 | gmail
ent_idfr_i9j0k1l2... | evt_notion_456... | evt_notion_456... | person | email | john@company.com | contact | 2025-01-21 | notion
Key Parameters:
model_name
: Source table containing the raw identifier dataevent_id
: Used for generating participant tables after identity resolutioncolumns
: List of identifier columns to unpivotedge_id_field
: Field used to group identifiers (oftenevent_id
if only one person or groups for an event. If there are multiple participants in an event, usually event_id + identifiers does well.)role_column
: Specifies the entity's role in eventsadditional_columns
: Metadata to preserve (timestamps, source info)
Result: Standardized identifier format that can be automatically processed by the identity resolution algorithm.
Step 1: Source Identifier Collection (nexus_entity_identifiers
)¶
Purpose: Union all entity identifiers from different source systems into a single table containing all entity types.
Process: The process_entity_identifiers()
macro collects identifiers from
all enabled sources that have entity identifier models (*_entity_identifiers
).
Example Data (from Lobbie appointments):
-- Source: lobbie_group_identifiers
event_id | identifier_type | identifier_value | role
evt_lobbie_5794cc43e896d7b21673bf1f8e... | location_id | 1190 | location
evt_lobbie_5794cc43e896d7b21673bf1f8e... | location_lobbie_integration_uuid | 5b283546-34c3-4d80-9c56-3cb9a8636f19 | location
evt_lobbie_bbfc0469c015233a40475fc6... | location_id | 339 | location
evt_lobbie_bbfc0469c015233a40475fc6... | location_lobbie_integration_uuid | 2d08a33b-d1d6-4bf8-9ce0-d6d416ba5c79 | location
Result: 3.6M identifier records from 1.8M appointment events, representing 395 unique locations.
Step 2: Edge Creation (nexus_group_identifiers_edges
)¶
Purpose: Create connections between identifiers that should be resolved to the same entity.
Process: The create_identifier_edges('nexus_group_identifiers')
macro:
- Raw Edge Generation: Joins identifiers that share the same
edge_id
(typicallyevent_id
) - Edge Deduplication: Uses surrogate keys on identifier types and value to generate uniqueness. Use uniqueness to eliminate duplicate edges for performance
- Output: Unique identifier pairs that represent the same entity
Algorithm Details:
-- Step 2a: Generate raw edges
SELECT
a.identifier_type as identifier_type_a,
a.identifier_value as identifier_value_a,
b.identifier_type as identifier_type_b,
b.identifier_value as identifier_value_b,
generate_surrogate_key([
a.identifier_type, a.identifier_value,
b.identifier_type, b.identifier_value
]) as edge_uniqueness_hash
FROM nexus_group_identifiers a
JOIN nexus_group_identifiers b
ON a.edge_id = b.edge_id -- Same edge
AND (a.identifier_type != b.identifier_type OR a.identifier_value != b.identifier_value)
-- Step 2b: Deduplicate edges
SELECT DISTINCT identifier_type_a, identifier_value_a, identifier_type_b, identifier_value_b
FROM raw_edges
Edge Deduplication Impact:
- Before deduplication: ~1.8M duplicate edges (one per event mentioning each location)
- After deduplication: 790 unique edges (395 locations × 2 identifier types each)
- Performance improvement: 3.99s vs hours of processing
Example Edges:
identifier_type_a | identifier_value_a | identifier_type_b | identifier_value_b
location_lobbie_integration_uuid | 5b283546-34c3-4d80-9c56-3cb9a8636f19 | location_id | 1190
location_lobbie_integration_uuid | 2d08a33b-d1d6-4bf8-9ce0-d6d416ba5c79 | location_id | 339
location_lobbie_integration_uuid | f3fcd059-3572-469b-847b-653a7cc65239 | location_id | 630
Step 3: Recursive Resolution (nexus_resolved_group_identifiers
)¶
Purpose: Use connected components algorithm to group all related identifiers under a single entity ID.
Process: The
resolve_identifiers('group', 'nexus_group_identifiers', 'nexus_group_identifiers_edges', 3)
macro implements a recursive CTE with these phases:
Phase 3a: Recursive Component Discovery¶
WITH RECURSIVE recursive_components AS (
-- Base case: Every identifier starts as its own component
SELECT DISTINCT
identifier_type as component_identifier_type,
identifier_value as component_identifier_value,
identifier_type,
identifier_value,
identifier_type || ':' || identifier_value as path,
0 as recursion_level
FROM nexus_group_identifiers
UNION ALL
-- Recursive case: Walk to connected identifiers
SELECT
rc.component_identifier_type,
rc.component_identifier_value,
e.identifier_type_b as identifier_type,
e.identifier_value_b as identifier_value,
rc.path || '|' || e.identifier_type_b || ':' || e.identifier_value_b as path,
rc.recursion_level + 1
FROM recursive_components rc
JOIN nexus_group_identifiers_edges e
ON rc.identifier_type = e.identifier_type_a
AND rc.identifier_value = e.identifier_value_a
WHERE NOT CONTAINS(rc.path, e.identifier_type_b || ':' || e.identifier_value_b)
AND rc.recursion_level < 3 -- Max recursion limit
)
Phase 3b: Component Assignment¶
-- Assign each identifier to the lexicographically first identifier in its component
SELECT
identifier_type,
identifier_value,
generate_surrogate_key([
first_value(component_identifier_type) OVER(...),
first_value(component_identifier_value) OVER(...)
]) as group_id
FROM recursive_components
Example Resolution:
-- Before resolution (separate identifiers):
identifier_type | identifier_value | group_id
location_id | 1190 | NULL
location_lobbie_integration_uuid | 5b283546-34c3-4d80-9c56-3cb9a8636f19 | NULL
-- After resolution (unified entity):
identifier_type | identifier_value | group_id
location_id | 1190 | 1433a23518736e48ab9b2eff2af17544
location_lobbie_integration_uuid | 5b283546-34c3-4d80-9c56-3cb9a8636f19 | 1433a23518736e48ab9b2eff2af17544
Result: 790 resolved identifiers representing 395 unique groups (2 identifiers per group).
Step 4: Final Entity Table (nexus_groups
)¶
Purpose: Create the final groups table with consolidated entity information.
Process: The finalize_entity('group')
macro:
- Entity Consolidation: Groups all identifiers by
group_id
- Metadata Enrichment: Adds creation timestamps, processing flags
- Deduplication: Ensures one record per unique entity
Example Groups:
group_id | created_at | existing_group
1433a23518736e48ab9b2eff2af17544 | 2025-01-20 12:45:20 | true
02124e742dda060ddbf750f8c385c3bf | 2025-01-20 12:45:20 | true
8341287f3c3f47ff97dc105df9d16b9b | 2025-01-20 12:45:20 | true
Step 5: Event Participation (nexus_group_participants
)¶
Purpose: Link resolved entities back to the original events for participation tracking.
Process: The finalize_participants('group')
macro:
- Event Mapping: Connects each event to its resolved group entity
- Role Preservation: Maintains the original role information from Step 0
- Participation Records: Creates event-to-entity links with role context
Algorithm:
SELECT
generate_surrogate_key([group_id, event_id]) as group_participant_id,
event_id,
group_id
FROM nexus_group_identifiers gi
JOIN nexus_resolved_group_identifiers rgi
ON gi.identifier_type = rgi.identifier_type
AND gi.identifier_value = rgi.identifier_value
Example Participations:
event_id | group_id | events_for_this_group
evt_lobbie_5794cc43e896d7b21673bf1f8e... | 1433a23518736e48ab9b2eff2af17544 | 10,180
evt_lobbie_bbfc0469c015233a40475fc6... | 02124e742dda060ddbf750f8c385c3bf | 24,974
evt_lobbie_39065a39e20208106fa5151d... | 8341287f3c3f47ff97dc105df9d16b9b | 21,719
Result: 1,806,682 participation records (one per event) linking events to 395 resolved location groups.
Role Information: Each participation record preserves the role specified in
Step 0 ('location'
in this example), enabling queries like:
- "Find all events where entity X participated as a location"
- "Get all appointment events at specific locations"
- "Analyze entity participation patterns by role type"
Role Examples:
- Groups:
'location'
,'venue'
,'department'
,'organization'
- Persons:
'patient'
,'provider'
,'contact'
,'assignee'
- Memberships:
'member_of'
,'employed_by'
,'enrolled_in'
Algorithm Performance¶
Scalability Characteristics¶
Linear Components:
- Source identifier collection: O(n) where n = number of events
- Final entity creation: O(g) where g = number of unique entities
- Event participation: O(n) where n = number of events
Optimized Components:
- Edge creation: O(u) where u = number of unique edges (after deduplication)
- Recursive resolution: O(u × d) where d = maximum recursion depth (3)
Performance Metrics (Lobbie Groups Example)¶
Step | Records In | Records Out | Processing Time | Optimization |
---|---|---|---|---|
Source Collection | 3.6M events | 3.6M identifiers | ~15s | Table materialization |
Edge Creation | 3.6M identifiers | 790 edges | 3.99s | Deduplication |
Recursive Resolution | 790 edges | 790 resolved | 4.91s | Limited recursion depth |
Final Entities | 790 resolved | 395 groups | ~3s | Component consolidation |
Event Participation | 3.6M + 790 | 1.8M participations | 3.32s | Efficient joins |
Edge Deduplication Benefits¶
Problem: Without deduplication, high-frequency entities create massive edge explosion:
- Location with 26,000 events → 26,000² = 676M duplicate edges
- Total system edges: Billions of redundant connections
Solution: Surrogate key-based deduplication:
- Unique edge identification using
generate_surrogate_key([type_a, value_a, type_b, value_b])
- Elimination of duplicate edges while preserving all unique relationships
- Performance improvement: Hours → Seconds
Impact:
- Edge reduction: 1.8M → 790 unique edges (99.96% reduction)
- Memory efficiency: Linear memory usage vs quadratic explosion
- Processing speed: Recursive algorithm operates on minimal edge set
Algorithm Guarantees¶
Correctness Properties¶
- Completeness: All source identifiers are processed and resolved
- Consistency: Identical entities receive identical group IDs across runs
- Transitivity: If A connects to B and B connects to C, then A, B, C share the same group ID
- Event Preservation: Every original event maintains its connection to resolved entities
Data Integrity Constraints¶
- Unique Entity IDs: Each resolved entity has exactly one group ID
- Bidirectional Resolution: All identifier types within a group resolve to the same entity
- Event Mapping: Every event connects to exactly one entity per identifier type
- Temporal Consistency: Entity resolution is deterministic and reproducible
Extension to Other Entity Types¶
This same algorithm applies to:
- Persons:
*_person_identifiers
→nexus_persons
+nexus_person_participants
- Memberships:
*_membership_identifiers
→nexus_memberships
The only differences are:
- Source table patterns (
*_person_identifiers
vs*_group_identifiers
) - Entity-specific finalization logic
- Role and relationship semantics
The core edge creation, recursive resolution, and deduplication logic remains identical across all entity types.
Related Documentation¶
- How-to: Add New Source - Step-by-step guide for implementing Step 0
- Reference: Identity Resolution Models - API details for all models
- Reference: Entity Resolution Macros - Macro documentation and parameters
- How-to: Configure Identity Resolution - Configuration options and tuning
- Explanation: Performance Considerations - Detailed performance analysis and optimization strategies
- Reference: Database Schema - Complete schema documentation for all tables