Troubleshooting Duplicate IDs¶

This guide helps you identify and resolve duplicate ID issues in nexus models. Duplicate IDs typically occur when the ID generation doesn't include enough unique components or when there's duplicate data in source systems.

1. Identify Which Model Has Duplicates¶

When you see a uniqueness test failure, first identify which model and data source is causing the issue:

# Run dbt test to see which models are failing
dbt test --models nexus_*

Look for error messages like:

Failure in test unique_nexus_person_identifiers_person_identifier_id
Got 2455 results, configured to fail if != 0

2. Find the Data Source Causing Duplicates¶

Use this query template to identify which data source has duplicates:

-- For person identifiers
SELECT source, COUNT(*) as duplicate_count
FROM (
    SELECT source, person_identifier_id, COUNT(*) as count
    FROM `your_project`.`your_dataset`.`nexus_person_identifiers`
    GROUP BY source, person_identifier_id
    HAVING COUNT(*) > 1
)
GROUP BY source
ORDER BY duplicate_count DESC;

-- For group identifiers
SELECT source, COUNT(*) as duplicate_count
FROM (
    SELECT source, group_identifier_id, COUNT(*) as count
    FROM `your_project`.`your_dataset`.`nexus_group_identifiers`
    GROUP BY source, group_identifier_id
    HAVING COUNT(*) > 1
)
GROUP BY source
ORDER BY duplicate_count DESC;

-- For membership identifiers
SELECT source, COUNT(*) as duplicate_count
FROM (
    SELECT source, membership_identifier_id, COUNT(*) as count
    FROM `your_project`.`your_dataset`.`nexus_membership_identifiers`
    GROUP BY source, membership_identifier_id
    HAVING COUNT(*) > 1
)
GROUP BY source
ORDER BY duplicate_count DESC;

3. Examine Specific Duplicate Records¶

Once you know the source, examine specific duplicates:

-- Find specific duplicate records
SELECT person_identifier_id, event_id, identifier_type, identifier_value, role, source, occurred_at
FROM `your_project`.`your_dataset`.`nexus_person_identifiers`
WHERE source = 'google_calendar'
AND person_identifier_id IN (
    SELECT person_identifier_id
    FROM `your_project`.`your_dataset`.`nexus_person_identifiers`
    WHERE source = 'google_calendar'
    GROUP BY person_identifier_id
    HAVING COUNT(*) > 1
)
ORDER BY person_identifier_id, occurred_at
LIMIT 10;

4. Common Duplicate Scenarios and Solutions¶

Scenario 1: String "null" Values in Source Data¶

Problem: Source data contains string "null" values instead of actual NULL values, causing duplicate person_identifier_ids when the same event has multiple identifier types with "null" values.

Example:

event_id: evt_123
email: "null" (string)
phone: "null" (string)
-- Both generate same person_identifier_id because they hash to same value

Solution: Use safe_cast_with_null_strings macro to handle null string variations:

-- In unpivot macros, replace:
cast({{ col }} as string) as identifier_value

-- With:
{{ nexus.safe_cast_with_null_strings(col, api.Column.translate_type("string")) }} as identifier_value

Helper Macro:

{% macro safe_cast_with_null_strings(column_name, target_type) %}
  case
    when {{ column_name }} is null then null
    when {{ column_name }} = 'null' then null
    when {{ column_name }} = 'NULL' then null
    when {{ column_name }} = 'None' then null
    when {{ column_name }} = 'none' then null
    when {{ column_name }} = '' then null
    else {{ dbt.safe_cast(column_name, api.Column.translate_type(target_type)) }}
  end
{% endmacro %}

Scenario 2: Cross-Contamination Between Identifier Types¶

Problem: Same value used for different identifier types (e.g., phone number used as email), creating duplicate person_identifier_ids.

Example:

entity_identifier_id: ent_idfr_abc123
entity_type: person
identifier_type: email
identifier_value: "6307776986" (phone number - wrong type!)

entity_identifier_id: ent_idfr_abc123 (same ID!)
entity_type: person
identifier_type: phone
identifier_value: "6307776986" (same value)

Solution: Use validation macros to prevent cross-contamination:

-- Email validation
{{ nexus.validate_and_normalize_email(column.name) }}

-- Phone validation (filters out emails)
{{ nexus.validate_and_normalize_phone(column.name) }}

Scenario 3: Missing Role in ID Generation¶

Problem: Same person/group appears multiple times with different roles but same ID.

Example:

entity_identifier_id: ent_idfr_abc123
entity_type: person
role: organizer
role: attendee
role: creator

Solution: Add role to ID generation in source models:

-- Before
{{ create_nexus_id('person_identifier', ['event_id', 'email', 'occurred_at']) }}

-- After
{{ create_nexus_id('person_identifier', ['event_id', 'email', 'role', 'occurred_at']) }}

Scenario 4: Multiple People from Same Domain¶

Problem: Multiple employees from same company attend same event, creating duplicate group identifiers.

Solution: Add deduplication to group identifier models:

-- Add GROUP BY to attendee domains CTE
attendee_domains AS (
    SELECT
        -- ... other fields ...
    FROM {{ ref('google_calendar_events_base') }} base,
    UNNEST(base.attendees) as attendee
    WHERE {{ filter_non_generic_domains('attendee.domain') }}
    GROUP BY base.nexus_event_id, attendee.domain, attendee.is_optional, base.start_time
)

Scenario 5: Duplicate Source Data¶

Problem: Same person appears multiple times in source data for same event.

Example:

-- Check source data for duplicates
SELECT nexus_event_id, attendee.email, attendee.domain, attendee.is_optional
FROM `your_project`.`development`.`google_calendar_events_base` base,
UNNEST(base.attendees) as attendee
WHERE base.nexus_event_id = 'evt_12345'
AND attendee.email = 'john@company.com'
ORDER BY attendee.email;

Solution: Add deduplication to source models:

-- Add GROUP BY to remove source duplicates
GROUP BY base.nexus_event_id, attendee.email, attendee.is_optional, base.start_time

Scenario 6: Missing Timestamp in ID Generation¶

Problem: Same identifier appears at different times but gets same ID.

Solution: Include timestamp in ID generation:

{{ create_nexus_id('person_identifier', ['event_id', 'email', 'role', 'occurred_at']) }}

Scenario 7: Participant Role Duplicates¶

Problem: Same entity participates in event with multiple roles but gets same participant ID.

Solution: Update finalize_participants macro to include role:

-- In finalize_participants macro
{{ create_nexus_id(entity_type ~ '_participant', ['event_id', entity_type ~ '_id', 'role']) }}

5. Testing Your Fixes¶

After implementing fixes:

Rebuild source models first:

dbt run --models gmail_person_identifiers google_calendar_person_identifiers

Rebuild nexus models:

dbt run --models nexus_person_identifiers

Run tests:

dbt test --models nexus_person_identifiers

Verify duplicate count reduction:

SELECT COUNT(*) as total_duplicates
FROM (
    SELECT person_identifier_id, COUNT(*) as count
    FROM `your_project`.`your_dataset`.`nexus_person_identifiers`
    GROUP BY person_identifier_id
    HAVING COUNT(*) > 1
);

6. Prevention Best Practices¶

ID Generation Guidelines¶

Always include role when entities can have multiple roles
Include timestamps when the same identifier can appear at different times
Use deduplication in source models when raw data has duplicates
Test incrementally - fix one source at a time

Source Model Patterns¶

-- Good: Includes role and deduplication
attendee_identifiers AS (
    SELECT
        {{ create_nexus_id('person_identifier', ['event_id', 'attendee.email', 'role', 'occurred_at']) }} as person_identifier_id,
        -- other fields --
    FROM {{ ref('source_base') }} base,
    UNNEST(base.attendees) as attendee
    WHERE attendee.email IS NOT NULL
    GROUP BY base.event_id, attendee.email, attendee.is_optional, base.occurred_at
)

7. Edge Case Scenarios¶

Composite Key Tests¶

If you see failures in edge tests like nexus_group_identifiers_edges, check the test definition:

# Wrong - array syntax doesn't work
tests:
  - unique:
      column_name: [edge_id, identifier_type_a, identifier_value_a]

# Correct - concatenated string syntax
tests:
  - unique:
      column_name: "edge_id || '|' || identifier_type_a || '|' || identifier_value_a || '|' || identifier_type_b || '|' || identifier_value_b"

Performance Considerations¶

For large datasets with many duplicates:

Fix highest-impact sources first (those with most duplicates)
Use LIMIT in diagnostic queries to avoid timeouts
Consider incremental rebuilds for large models

8. Common Error Messages and Solutions¶

Error Pattern	Likely Cause	Solution
`Got 2000+ results`	String "null" values in source	Use `safe_cast_with_null_strings` macro
`Got 1000+ results`	Missing role in ID generation	Add role to `create_nexus_id` call
`Got 1-10 results`	Duplicate source data	Add deduplication with `GROUP BY`
`Cross-contamination`	Same value for different types	Use validation macros for identifiers
`Edge test failing`	Wrong test syntax	Use concatenated string syntax
`Participant duplicates`	Missing role in participant macro	Update `finalize_participants` macro

For additional help, see the Testing Reference for detailed information about all available tests.