Skip to content

Package Architecture

dbt-nexus follows a layered architecture designed for scalability, maintainability, and source-agnostic data processing. This document explains the core architectural concepts and data flow.

Architectural Overview

The package is organized into five primary layers that transform raw data into resolved, production-ready entities:

graph TD
    A[Raw Data Layer] --> B[Source Event Log Layer]
    B --> C[Core Event Log Layer]
    C --> D[Identity Resolution Layer]
    D --> E[Final Tables Layer]

    style A fill:#dae8fc
    style B fill:#ffe6cc
    style C fill:#f8cecc
    style D fill:#e1d5e7
    style E fill:#d5e8d4

Layer 1: Raw Data

  • Purpose: Direct ingestion from source systems with minimal transformation
  • Characteristics:
  • Preserves original data structure
  • Applies incremental filtering based on synced_at
  • Adds _ingested_at tracking
  • Isolates downstream models from source changes

Layer 2: Source Event Log

  • Purpose: Normalize source-specific events and identifiers
  • Characteristics:
  • Translates raw data into standardized format
  • Preserves source-specific details
  • Follows consistent naming patterns
  • Maintains source attribution

Layer 3: Core Event Log

  • Purpose: Unified events with standardized schema across sources
  • Characteristics:
  • Unions data from multiple sources
  • Standardized core fields only
  • Source-agnostic processing
  • Enables cross-source analysis

Layer 4: Identity Resolution

  • Purpose: Resolve identities and deduplicate entities across sources
  • Characteristics:
  • Graph-based identity resolution
  • Recursive CTE algorithms
  • Cross-source entity matching
  • Maintains source provenance

Layer 5: Final Tables

  • Purpose: Production-ready tables for application use
  • Characteristics:
  • Clean, resolved entities
  • Latest attribute values
  • Optimized for queries
  • Ready for operational use

Database Schema

The following diagram illustrates the complete database schema and relationships between entities:

Database Schema

This diagram shows the data flow from raw sources through identity resolution to final tables, including all entity relationships and key fields.

Core Entities

Events

The central entity that captures "what happened" across all sources:

-- Core event structure
{
  id: UUID,
  occurred_at: TIMESTAMP,
  event_name: STRING,
  event_description: STRING,
  event_type: STRING,
  source: STRING,
  value: NUMERIC (optional),
  value_unit: STRING (optional)
}

Persons

Individual entities with identifiers and traits:

-- Entity structure (v0.3.0)
{
  entity_id: UUID,
  entity_type: STRING,  -- 'person', 'group', or custom types
  identifiers: {
    email: STRING,
    phone: STRING,
    user_id: STRING
  },
  traits: {
    name: STRING,
    timezone: STRING,
    title: STRING
  }
}

Groups

Organizational entities (companies, accounts):

-- Group structure
{
  group_id: UUID,
  identifiers: {
    domain: STRING,
    company_id: STRING,
    shopify_id: STRING
  },
  traits: {
    name: STRING,
    industry: STRING,
    size: STRING
  }
}

Memberships

Relationships between persons and groups:

-- Relationship structure (v0.3.0)
{
  relationship_id: UUID,
  entity_a_id: UUID,
  entity_a_type: STRING,  -- 'person', 'group', etc.
  entity_b_id: UUID,
  entity_b_type: STRING,  -- 'person', 'group', etc.
  relationship_type: STRING,  -- 'membership', 'collaboration', etc.
  entity_a_role: STRING,
  entity_b_role: STRING,
  occurred_at: TIMESTAMP,
  source: STRING
}

Source-Agnostic Design

Adapter Pattern

dbt-nexus uses an adapter pattern that allows any data source to be integrated by implementing a standard interface:

# Source configuration
vars:
  nexus:
    sources:
      shopify_partner:
        enabled: true
        events: true
        entities: ["group"]
        relationships: false
      gmail:
        enabled: true
        events: true
        entities: ["person"]
        relationships: false

Interface Requirements

Each source must provide models following these specifications:

Events Model Interface

-- Required fields for source events
SELECT
  id,                    -- Unique event identifier
  occurred_at,           -- Event timestamp
  event_name,            -- Specific event name
  event_description,     -- Human-readable description
  event_type,            -- Event category
  source,                -- Source system name
  value,                 -- Optional numeric value
  value_unit,            -- Optional value unit
  _ingested_at           -- Processing timestamp
FROM source_data

Identifiers Model Interface

-- Required fields for identifiers
SELECT
  id,                    -- Unique identifier record ID
  event_id,              -- Reference to source event
  edge_id,                -- Groups related identifiers
  identifier_type,       -- Type (email, phone, domain)
  identifier_value,      -- Actual identifier value
  source,                -- Source system
  occurred_at,           -- When collected
  _ingested_at           -- When processed
FROM source_data

Traits Model Interface

-- Required fields for traits
SELECT
  id,                    -- Unique trait record ID
  event_id,              -- Reference to source event
  edge_id,                -- Groups related traits
  trait_name,            -- Trait name
  trait_value,           -- Trait value
  source,                -- Source system
  occurred_at,           -- When collected
  _ingested_at           -- When processed
FROM source_data

Identity Resolution Process

Graph-Based Resolution

The identity resolution system uses graph algorithms to connect related identifiers:

  1. Edge Generation: Create connections between identifiers
  2. Edge Normalization: Eliminate directional duplicates (A→B = B→A)
  3. Connected Components: Find groups of connected identifiers
  4. Entity Assignment: Assign unique IDs to each connected component

Recursive CTE Implementation

WITH RECURSIVE identity_graph AS (
  -- Base case: direct identifier relationships
  SELECT
    identifier_a,
    identifier_b,
    1 as depth
  FROM edges

  UNION ALL

  -- Recursive case: transitive relationships
  SELECT
    ig.identifier_a,
    e.identifier_b,
    ig.depth + 1
  FROM identity_graph ig
  JOIN edges e ON ig.identifier_b = e.identifier_a
  WHERE ig.depth < {{ var('nexus', {}).get('max_recursion', 5) }}
)

Deduplication Strategy

The system handles various deduplication challenges:

  • Order Independence: A→B treated same as B→A
  • Temporal Awareness: Preserves earliest occurrence
  • Source Attribution: Maintains provenance
  • Incremental Processing: Only processes new data

Performance Optimization

Incremental Processing

All models support incremental materialization:

{{ config(materialized='incremental', unique_key='id') }}

{% if is_incremental() %}
  WHERE _ingested_at > (SELECT MAX(_ingested_at) FROM {{ this }})
{% endif %}

Per-Source Watermarking

When unioning multiple sources, the system applies per-source filtering:

-- Per-source watermark filtering prevents missing late-arriving data
{{ union_with_watermarks(sources_var='event_sources') }}

Materialization Strategy

  • Development: Views for fast iteration
  • Production: Tables for performance
  • Large Tables: Incremental for efficiency

Cross-Database Compatibility

Database-Agnostic SQL

The package uses dbt's adapter pattern for cross-database compatibility:

{% macro generate_surrogate_key(field_list) %}
  {{ return(adapter.dispatch('generate_surrogate_key', 'nexus')(field_list)) }}
{% endmacro %}

{% macro default__generate_surrogate_key(field_list) %}
  -- Default implementation
  {{ dbt_utils.surrogate_key(field_list) }}
{% endmacro %}

{% macro snowflake__generate_surrogate_key(field_list) %}
  -- Snowflake-specific implementation
  {{ dbt_utils.surrogate_key(field_list) }}
{% endmacro %}

Supported Warehouses

  • Snowflake
  • BigQuery
  • Postgres
  • Redshift
  • Databricks

Configuration Management

Variable-Driven Architecture

The package uses dbt variables for flexible configuration:

vars:
  nexus:
    max_recursion: 5 # Identity resolution depth limit
    entity_types: ["person", "group"]
    sources:
      gmail:
        enabled: true
        events: true
        entities: ["person", "group"]
        relationships: true
  override_incremental: false # Force full refresh in development

Schema Organization

Models are organized into logical schemas:

models:
  nexus:
    event_log:
      +schema: event_log
    identity_resolution:
      +schema: identity_resolution
    final_tables:
      +schema: marts

Extensibility

Adding New Sources

  1. Create source adapter models following interface requirements
  2. Add source configuration to sources variable
  3. Run dbt deps and dbt run
  4. Core models automatically include new source

Custom Entity Types

The architecture supports custom entity types beyond persons/groups:

-- Example: Location entities
{% macro process_location_identifiers(source_model) %}
  {{ process_identifiers('location', source_model) }}
{% endmacro %}

Macro Extensibility

Custom transformations can be added through the macro system:

-- Custom normalization logic
{% macro normalize_phone_number(phone_field) %}
  REGEXP_REPLACE({{ phone_field }}, r'[^\d]', '')
{% endmacro %}

Testing Strategy

Data Quality Tests

models:
  - name: nexus_entities
    tests:
      - unique:
          column_name: entity_id
      - not_null:
          column_name: entity_id
      - not_null:
          column_name: entity_type
    columns:
      - name: entity_type
        tests:
          - accepted_values:
              values: ["person", "group"]
      - name: email
        tests:
          - not_null

Integration Tests

The package includes comprehensive integration tests with sample data to validate:

  • Identity resolution accuracy
  • Cross-source data integration
  • Performance characteristics
  • Edge case handling