Database Schema Reference¶
This document provides a comprehensive reference for the dbt-nexus database schema, including all entities, relationships, and data flow patterns.
Schema Overview¶
The dbt-nexus database schema is organized into five distinct layers, each serving a specific purpose in the data transformation pipeline:
- Raw Data Layer - Direct source ingestion
- Source Event Log Layer - Source-specific normalization
- Core Event Log Layer - Unified cross-source schema
- Identity Resolution Layer - Entity deduplication and resolution
- Final Tables Layer - Production-ready entities
Visual Schema Diagram¶
The complete database schema and data flow can be visualized using the interactive diagram below. This diagram shows all entities, their relationships, and the transformation flow from raw data to resolved entities.
Opening the Schema Diagram¶
The database schema diagram is available as an interactive draw.io file:
- File Location:
database-diagram.xml
in the package root - Opening Instructions:
- Go to draw.io
- Click "Open Existing Diagram"
- Upload the
database-diagram.xml
file - The complete schema will be displayed with interactive navigation
Diagram Sections¶
The diagram is organized into color-coded sections:
- 🔵 Raw Data (Blue) - Source tables with minimal processing
- 🟠 Source Event Log (Orange) - Source-specific normalized events
- 🔴 Core Event Log (Red) - Unified event schema
- 🟣 Identity Resolution (Purple) - Resolved identifiers and traits
- 🟢 Final Tables (Green) - Production entity tables
Entity Reference¶
Core Entities¶
Events¶
Purpose: Central log of all activities across data sources
Column | Type | Description | Required | Notes |
---|---|---|---|---|
event_id |
STRING | Unique event identifier | ✅ | Primary key, UUID format |
occurred_at |
TIMESTAMP | When the event occurred | ✅ | Business timestamp |
event_name |
STRING | Specific event name | ✅ | e.g., 'email_sent', 'app_installed' |
event_description |
STRING | Human-readable description | ❌ | e.g., 'Email sent to john@example.com' |
event_type |
STRING | Event category | ✅ | e.g., 'communication', 'product_usage' |
source |
STRING | Source system name | ✅ | e.g., 'gmail', 'shopify_partner' |
value |
NUMERIC | Numeric value (if applicable) | ❌ | e.g., transaction amount |
value_unit |
STRING | Unit of the value field | ❌ | e.g., 'USD', 'count' |
_ingested_at |
TIMESTAMP | When dbt processed the record | ✅ | System timestamp |
Column Naming Strategy¶
The events table uses a specific naming strategy to avoid conflicts in unions and joins:
Prefixed Columns (require event_
prefix):
event_id
- Generic "id" would conflict across sourcesevent_name
- Generic "name" would conflict across sourcesevent_description
- Generic "description" would conflict across sourcesevent_type
- Generic "type" would conflict across sources
Non-Prefixed Columns (standard event tracking fields):
value
- Standard field in event tracking systems (GA4, etc.)significance
- Specific to event scoring, not genericoccurred_at
- Standard timestamp fieldsource
- Standard source attribution field
This strategy ensures compatibility with industry-standard event tracking while preventing column name conflicts during data unioning.
Entities (nexus_entities
)¶
Purpose: Unified table for all resolved entities (persons, groups, and custom entity types)
Column | Type | Description | Required | Notes |
---|---|---|---|---|
entity_id |
STRING | Unique entity identifier | ✅ | Primary key, varies by entity_type |
entity_type |
STRING | Type of entity | ✅ | 'person', 'group', etc. |
event_id |
STRING | Reference to source event | ✅ | Foreign key to events.id |
edge_id |
STRING | Groups related identifiers | ✅ | Surrogate key for grouping |
identifier_type |
STRING | Type of identifier | ✅ | e.g., 'email', 'phone', 'user_id' |
identifier_value |
STRING | Actual identifier value | ✅ | e.g., 'john@example.com' |
source |
STRING | Source system | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When identifier was collected | ✅ | Business timestamp |
_ingested_at |
TIMESTAMP | When processed by dbt | ✅ | System timestamp |
Note: The nexus_entities
table is populated from the union of
entity_type-specific columns. For persons, columns like email
, name
, phone
are populated; for groups, columns like domain
, company_name
are populated.
Other entity types follow their own schema.
Entity Traits (nexus_entity_traits
)¶
Purpose: Attributes and characteristics for all entity types
Column | Type | Description | Required | Notes |
---|---|---|---|---|
entity_trait_id |
STRING | Unique trait record | ✅ | Primary key (ent*tr* prefix) |
event_id |
STRING | Reference to source event | ✅ | Foreign key to events.event_id |
entity_type |
STRING | Type of entity | ✅ | 'person', 'group', etc. |
identifier_type |
STRING | Type used to identify the entity | ✅ | e.g., 'email', 'domain' |
identifier_value |
STRING | Value used to identify entity | ✅ | e.g., 'john@example.com', 'example.com' |
trait_name |
STRING | Name of the trait | ✅ | e.g., 'name', 'title', 'company_name' |
trait_value |
STRING | Value of the trait | ✅ | e.g., 'John Smith', 'Acme Corp' |
role |
STRING | Entity's role in event | ❌ | e.g., 'sender', 'organizer' |
source |
STRING | Source system | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When trait was collected | ✅ | Business timestamp |
_ingested_at |
TIMESTAMP | When processed by dbt | ✅ | System timestamp |
Relationship Declarations (nexus_relationship_declarations
)¶
Purpose: Declared relationships between any two entities
Column | Type | Description | Required | Notes |
---|---|---|---|---|
relationship_declaration_id |
STRING | Unique declaration record | ✅ | Primary key (rel*decl* prefix) |
event_id |
STRING | Reference to source event | ✅ | Foreign key to events.event_id |
occurred_at |
TIMESTAMP | When relationship was declared | ✅ | Business timestamp |
entity_a_identifier |
STRING | Entity A identifier value | ✅ | e.g., email address |
entity_a_identifier_type |
STRING | Entity A identifier type | ✅ | e.g., 'email' |
entity_a_type |
STRING | Entity A type | ✅ | e.g., 'person' |
entity_a_role |
STRING | Entity A's role in relationship | ❌ | e.g., 'member', 'owner' |
entity_b_identifier |
STRING | Entity B identifier value | ✅ | e.g., domain |
entity_b_identifier_type |
STRING | Entity B identifier type | ✅ | e.g., 'domain' |
entity_b_type |
STRING | Entity B type | ✅ | e.g., 'group' |
entity_b_role |
STRING | Entity B's role in relationship | ❌ | e.g., 'organization' |
relationship_type |
STRING | Type of relationship | ✅ | e.g., 'membership', 'ownership' |
relationship_direction |
STRING | Relationship directionality | ✅ | 'a_to_b', 'b_to_a', 'bidirectional' |
is_active |
BOOLEAN | Whether relationship is active | ✅ | true/false |
source |
STRING | Source system | ✅ | Source attribution |
Resolved Entities¶
Resolved Person Identifiers (nexus_resolved_person_identifiers
)¶
Purpose: Deduplicated person identifiers after identity resolution
Column | Type | Description | Required | Notes |
---|---|---|---|---|
person_id |
STRING | Resolved person identifier | ✅ | Generated by identity resolution |
identifier_type |
STRING | Type of identifier | ✅ | e.g., 'email', 'phone' |
identifier_value |
STRING | Actual identifier value | ✅ | e.g., 'john@example.com' |
source |
STRING | Source of earliest occurrence | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When first collected | ✅ | Earliest business timestamp |
Resolved Group Identifiers (nexus_resolved_group_identifiers
)¶
Purpose: Deduplicated group identifiers after identity resolution
Column | Type | Description | Required | Notes |
---|---|---|---|---|
group_id |
STRING | Resolved group identifier | ✅ | Generated by identity resolution |
identifier_type |
STRING | Type of identifier | ✅ | e.g., 'domain', 'company_id' |
identifier_value |
STRING | Actual identifier value | ✅ | e.g., 'example.com' |
source |
STRING | Source of earliest occurrence | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When first collected | ✅ | Earliest business timestamp |
Resolved Group Traits¶
Purpose: Consolidated group attributes with latest values
Column | Type | Description | Required | Notes |
---|---|---|---|---|
group_id |
STRING | Resolved group identifier | ✅ | Links to resolved identifiers |
trait_name |
STRING | Name of the trait | ✅ | e.g., 'company_name', 'industry' |
trait_value |
STRING | Most recent trait value | ✅ | Latest value from any source |
source |
STRING | Source of latest value | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When latest value was collected | ✅ | Business timestamp |
Resolved Membership Identifiers¶
Purpose: Resolved relationships between persons and groups
Column | Type | Description | Required | Notes |
---|---|---|---|---|
id |
STRING | Unique membership identifier | ✅ | Primary key |
person_id |
STRING | Resolved person identifier | ✅ | Foreign key to persons |
group_id |
STRING | Resolved group identifier | ✅ | Foreign key to groups |
role |
STRING | Current role in the group | ❌ | Most recent role |
source |
STRING | Source of latest information | ✅ | Source attribution |
occurred_at |
TIMESTAMP | When membership was last updated | ✅ | Business timestamp |
Final Production Tables¶
Entities (nexus_entities
)¶
Purpose: Production table with complete entity profiles for all entity types
Column | Type | Description | Required | Notes |
---|---|---|---|---|
entity_id |
STRING | Unique entity identifier | ✅ | Primary key (person_id or group_id) |
entity_type |
STRING | Type of entity | ✅ | 'person', 'group', custom types |
email |
STRING | Primary email (persons) | ❌ | Most recent, person entities only |
name |
STRING | Full name (persons/groups) | ❌ | Most recent |
phone |
STRING | Phone number (persons) | ❌ | Most recent, person entities only |
title |
STRING | Job title (persons) | ❌ | Most recent, person entities only |
domain |
STRING | Primary domain (groups) | ❌ | Most recent, group entities only |
company_name |
STRING | Organization name (groups) | ❌ | Most recent, group entities only |
industry |
STRING | Industry (groups) | ❌ | Most recent, group entities only |
primary_source |
STRING | Source with earliest identifier | ✅ | Source attribution |
last_updated_at |
TIMESTAMP | When entity was last updated | ✅ | Most recent trait timestamp |
Note: Columns are populated based on entity_type
. Person entities have
email/name/phone populated; group entities have domain/company_name populated.
Relationships (nexus_relationships
)¶
Purpose: Production table with resolved relationships between any two entities
Column | Type | Description | Required | Notes |
---|---|---|---|---|
relationship_id |
STRING | Unique relationship identifier | ✅ | Primary key (rel_ prefix) |
entity_a_id |
STRING | Entity A identifier | ✅ | Foreign key to nexus_entities |
entity_b_id |
STRING | Entity B identifier | ✅ | Foreign key to nexus_entities |
relationship_type |
STRING | Type of relationship | ✅ | e.g., 'membership', 'ownership' |
is_active |
BOOLEAN | Whether relationship is currently active | ✅ | true/false |
established_at |
TIMESTAMP | When relationship was first declared | ✅ | Earliest declaration timestamp |
last_updated_at |
TIMESTAMP | When relationship was last updated | ✅ | Most recent declaration timestamp |
primary_source |
STRING | Source of earliest declaration | ✅ | Source attribution |
Backward Compatibility: Legacy views available in client projects:
persons
- Filtersnexus_entities WHERE entity_type = 'person'
groups
- Filtersnexus_entities WHERE entity_type = 'group'
memberships
- Filtersnexus_relationships WHERE relationship_type = 'membership'
Relationships and Constraints¶
Primary Relationships¶
erDiagram
EVENTS ||--o{ ENTITY_IDENTIFIERS : generates
EVENTS ||--o{ ENTITY_TRAITS : generates
EVENTS ||--o{ RELATIONSHIP_DECLARATIONS : generates
ENTITY_IDENTIFIERS ||--o{ ENTITY_IDENTIFIERS_EDGES : creates
ENTITY_IDENTIFIERS_EDGES ||--o{ RESOLVED_PERSON_IDENTIFIERS : resolves_to
ENTITY_IDENTIFIERS_EDGES ||--o{ RESOLVED_GROUP_IDENTIFIERS : resolves_to
ENTITY_TRAITS ||--o{ RESOLVED_ENTITY_TRAITS : resolves_to
RELATIONSHIP_DECLARATIONS ||--o{ RESOLVED_RELATIONSHIP_DECLARATIONS : resolves_to
RESOLVED_PERSON_IDENTIFIERS ||--|| ENTITIES : materializes_as
RESOLVED_GROUP_IDENTIFIERS ||--|| ENTITIES : materializes_as
RESOLVED_ENTITY_TRAITS ||--|| ENTITIES : enriches
RESOLVED_RELATIONSHIP_DECLARATIONS ||--|| RELATIONSHIPS : materializes_as
ENTITIES ||--o{ RELATIONSHIPS : entity_a
ENTITIES ||--o{ RELATIONSHIPS : entity_b
Key Constraints¶
Uniqueness Constraints¶
events.id
- Unique across all eventsperson_identifiers.id
- Unique across all person identifier recordspersons.person_id
- Unique resolved person identifiersgroups.group_id
- Unique resolved group identifiers
Referential Integrity¶
- All
event_id
fields referenceevents.id
memberships.person_id
referencespersons.person_id
memberships.group_id
referencesgroups.group_id
Business Rules¶
- Each
edge_id
groups related identifiers/traits from the same source record - Identity resolution preserves earliest occurrence timestamps
- Trait resolution preserves latest values with source attribution
- Membership relationships maintain current role information
Data Flow Patterns¶
Source Data Ingestion¶
- Raw data arrives via ETL processes with
synced_at
timestamps - Source adapters normalize to standard interfaces
- Incremental processing based on
_ingested_at
watermarks
Identity Resolution Flow¶
- Edge Generation: Create connections between related identifiers
- Edge Normalization: Eliminate directional duplicates
- Connected Components: Find groups of connected identifiers
- Entity Assignment: Generate unique IDs for each component
Trait Resolution Flow¶
- Collection: Gather traits from all sources
- Association: Link traits to resolved entities
- Deduplication: Select most recent values per entity
- Materialization: Create final entity profiles
Performance Considerations¶
Indexing Strategy¶
-- Recommended indexes for optimal performance
CREATE INDEX idx_events_occurred_at ON events(occurred_at);
CREATE INDEX idx_events_source ON events(source);
CREATE INDEX idx_person_identifiers_type_value ON person_identifiers(identifier_type, identifier_value);
CREATE INDEX idx_group_identifiers_type_value ON group_identifiers(identifier_type, identifier_value);
Partitioning Strategy¶
-- Recommended partitioning for large tables
PARTITION BY DATE(occurred_at) -- For events and time-series data
CLUSTER BY person_id, group_id -- For resolved entities
Related Documentation¶
- Architecture Overview - High-level architectural concepts
- Identity Resolution Logic - Detailed resolution algorithms
- Data Contracts - Interface specifications
- Model Reference - Individual model documentation