Entity Resolutionaggregate · returns varchar

DEDUPE

Deduplicates texts by semantic similarity (embeddings + threshold graph)

Per-group — reads the whole group in one call.

entity-resolutionembedding-modelspecialist-zooscales-largejson

Syntax

DEDUPE({{ texts }})

DEDUPE({{ texts }}, '{{ criteria }}')

Arguments

name	type	description
texts	JSON	—
criteria	VARCHAR	—
threshold(optional)	DOUBLE	—

About

Semantic deduplication — groups items by similarity and returns one representative per group. Useful for cleaning up fuzzy duplicates ("John Smith" / "Jon Smith" / "Johnny Smith" → pick one). Backend: specialist zoo bge-m3 embeddings + similarity-threshold graph. Algorithm: 1. Embed every value with bge-m3 (L2-normalized). 2. Build an N×N cosine similarity matrix. 3. Build an undirected graph where each pair with similarity >= threshold is connected. 4. Find connected components (union-find) — each component is a "same entity" group. 5. For each component, pick the longest (most complete) form as the canonical representative, ties broken by lexicographic order. 6. Return the representatives as a JSON array. Threshold was calibrated against "John Smith" / "Jon Smith" / "Johnny Smith" / "Jane Doe" / "J. Doe" / "Michael Johnson": - within-cluster similarity: 0.79-0.87 - cross-cluster similarity: 0.44-0.51 Default threshold 0.70 catches all within-cluster pairs without any cross-cluster false positives. Pass a different `threshold` arg to tune per use case. Complexity is O(N) embeddings + O(N²) pairwise cosines + O(N²) for the union-find. Scales to 10K+ values in seconds; for 100K+ values you'd want a blocking strategy (e.g., LSH) but Phase 0 keeps it simple. For LLM-style deduplication with nuanced judgment (e.g., handling abbreviations, variant spellings that need world knowledge), use DEDUPE_LLM — see dedupe_llm.cascade.yaml.

Examples

Deduplicates similar names

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('John Smith'),
          ('Jon Smith'),
          ('Johnny Smith'),
          ('Jane Doe'),
          ('J. Doe'),
          ('Michael Johnson')
      ) AS t (name)
  )
SELECT
  DEDUPE (name)
FROM
  test_data

Nearby rabbit holes

same domain

aggregate

DEDUPE_LLM

LLM-backed deduplication (escape hatch for DEDUPE)

entity-resolutionllmllm-escape-hatchscales-large+1

aggregate

GOLDEN_RECORD

Merge duplicate records into a composite golden record

entity-resolutionllmtext

aggregate

MERGE_RECORDS

Merge records with an explicit conflict-resolution strategy

entity-resolutionllmtext

scalar

SAME_AS

Check if two values refer to the same entity (fuzzy equality)

entity-resolutionllmtext

Climb back to The Looking Glass

surface↓entity resolution↓dedupe

Entity Resolutionaggregate · returns varchar

DEDUPE

Deduplicates texts by semantic similarity (embeddings + threshold graph)

Per-group — reads the whole group in one call.

entity-resolutionembedding-modelspecialist-zooscales-largejson

Syntax

DEDUPE({{ texts }})

DEDUPE({{ texts }}, '{{ criteria }}')

Arguments

name	type	description
texts	JSON	—
criteria	VARCHAR	—
threshold(optional)	DOUBLE	—

About

Examples

Deduplicates similar names

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('John Smith'),
          ('Jon Smith'),
          ('Johnny Smith'),
          ('Jane Doe'),
          ('J. Doe'),
          ('Michael Johnson')
      ) AS t (name)
  )
SELECT
  DEDUPE (name)
FROM
  test_data

Nearby rabbit holes

same domain

aggregate

DEDUPE_LLM

LLM-backed deduplication (escape hatch for DEDUPE)

entity-resolutionllmllm-escape-hatchscales-large+1

aggregate

GOLDEN_RECORD

Merge duplicate records into a composite golden record

entity-resolutionllmtext

aggregate

MERGE_RECORDS

Merge records with an explicit conflict-resolution strategy

entity-resolutionllmtext

scalar

SAME_AS

Check if two values refer to the same entity (fuzzy equality)

entity-resolutionllmtext

Climb back to The Looking Glass

surface↓entity resolution↓dedupe

Entity Resolutionaggregate · returns varchar

DEDUPE

Deduplicates texts by semantic similarity (embeddings + threshold graph)

Per-group — reads the whole group in one call.

entity-resolutionembedding-modelspecialist-zooscales-largejson

Syntax

DEDUPE({{ texts }})

DEDUPE({{ texts }}, '{{ criteria }}')

Arguments

name	type	description
texts	JSON	—
criteria	VARCHAR	—
threshold(optional)	DOUBLE	—

About

Examples

Deduplicates similar names

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('John Smith'),
          ('Jon Smith'),
          ('Johnny Smith'),
          ('Jane Doe'),
          ('J. Doe'),
          ('Michael Johnson')
      ) AS t (name)
  )
SELECT
  DEDUPE (name)
FROM
  test_data

Nearby rabbit holes

same domain

aggregate

DEDUPE_LLM

LLM-backed deduplication (escape hatch for DEDUPE)

entity-resolutionllmllm-escape-hatchscales-large+1

aggregate

GOLDEN_RECORD

Merge duplicate records into a composite golden record

entity-resolutionllmtext

aggregate

MERGE_RECORDS

Merge records with an explicit conflict-resolution strategy

entity-resolutionllmtext

scalar

SAME_AS

Check if two values refer to the same entity (fuzzy equality)

entity-resolutionllmtext

Climb back to The Looking Glass