surfaceentity resolutiondedupe
Entity Resolutionaggregate · returns varchar

DEDUPE

Deduplicates texts by semantic similarity (embeddings + threshold graph)

Per-group — reads the whole group in one call.

entity-resolutionembedding-modelspecialist-zooscales-largejson

Syntax

DEDUPE({{ texts }})
DEDUPE({{ texts }}, '{{ criteria }}')

Arguments

nametypedescription
textsJSON
criteriaVARCHAR
threshold(optional)DOUBLE

About

Semantic deduplication — groups items by similarity and returns one representative per group. Useful for cleaning up fuzzy duplicates ("John Smith" / "Jon Smith" / "Johnny Smith" → pick one). Backend: specialist zoo bge-m3 embeddings + similarity-threshold graph. Algorithm: 1. Embed every value with bge-m3 (L2-normalized). 2. Build an N×N cosine similarity matrix. 3. Build an undirected graph where each pair with similarity >= threshold is connected. 4. Find connected components (union-find) — each component is a "same entity" group. 5. For each component, pick the longest (most complete) form as the canonical representative, ties broken by lexicographic order. 6. Return the representatives as a JSON array. Threshold was calibrated against "John Smith" / "Jon Smith" / "Johnny Smith" / "Jane Doe" / "J. Doe" / "Michael Johnson": - within-cluster similarity: 0.79-0.87 - cross-cluster similarity: 0.44-0.51 Default threshold 0.70 catches all within-cluster pairs without any cross-cluster false positives. Pass a different `threshold` arg to tune per use case. Complexity is O(N) embeddings + O(N²) pairwise cosines + O(N²) for the union-find. Scales to 10K+ values in seconds; for 100K+ values you'd want a blocking strategy (e.g., LSH) but Phase 0 keeps it simple. For LLM-style deduplication with nuanced judgment (e.g., handling abbreviations, variant spellings that need world knowledge), use DEDUPE_LLM — see dedupe_llm.cascade.yaml.

Examples

Deduplicates similar names

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('John Smith'),
          ('Jon Smith'),
          ('Johnny Smith'),
          ('Jane Doe'),
          ('J. Doe'),
          ('Michael Johnson')
      ) AS t (name)
  )
SELECT
  DEDUPE (name)
FROM
  test_data

Nearby rabbit holes

same domain
Climb back to The Looking Glass