surfacesimilaritymatch_pair
Similarityscalar · returns boolean

MATCH_PAIR

Checks whether two values match under a relationship

Per-row — runs once for each row.

similarityembedding-modelspecialist-zootext

Arguments

nametypedescription
leftVARCHAR
rightVARCHAR
relationship(optional)VARCHAR

About

Pairwise semantic relationship check for fuzzy JOINs and deduplication. Returns `TRUE` when two values refer to the same underlying entity under the given `relationship` (default: `"same entity"`). Handles name abbreviations, typos, formatting differences, company-name variants, and other surface-level drift that a `=` or `LIKE` would miss. ## Uses **1. As a scalar (direct call):** SELECT match_pair('John Smith', 'J. Smith') -- → TRUE SELECT match_pair('Acme Corp', 'ACME Corporation') -- → TRUE **2. As the body of a `SEMANTIC JOIN`** (sugared): SELECT c.company, s.vendor FROM customers c SEMANTIC JOIN suppliers s ON c.company ~ s.vendor; The `SEMANTIC JOIN ... ON a ~ b` syntax is rewritten at parse time to: FROM customers c CROSS JOIN suppliers s WHERE match_pair(c.company, s.vendor, 'same entity') …so the join runs as an LLM-filtered cross product. With an additional `WHERE` clause the rewriter correctly merges predicates with `AND`: FROM customers c SEMANTIC JOIN suppliers s ON c.company ~ s.vendor WHERE s.region = 'US' -- rewrites to: -- CROSS JOIN suppliers s WHERE match_pair(...) AND s.region = 'US' **3. For deduplication** via a self-join + `match_pair`: SELECT a.id AS keep, b.id AS dup FROM entities a, entities b WHERE a.id < b.id AND match_pair(a.name, b.name); ## Overriding the relationship The third argument lets you check any pairwise relationship, not just identity: SELECT match_pair('Cat', 'Kitten', 'parent/child animal') SELECT match_pair('Paris', 'France', 'city in country') ## Performance Every pair is an LLM call, so keep the cross product bounded: - Pre-filter aggressively on cheap columns before the `SEMANTIC JOIN` - Use `-- @ takes.factor: 1` to disable retries for throughput-bound joins - For very large cohorts prefer `semantic_match_rows(a, b)` which uses bge-m3 embeddings to bulk-score candidate pairs first ## Related operators - `semantic_match_rows(array, array, k)` — embedding-based bulk matcher - `semantic_dedupe(array)` — collapse near-duplicates inside one column - `SEMANTIC DISTINCT col` — deduplicated SELECT sugar

Examples

Name variations match

SELECT
  semantic_match_pair ('John Smith', 'J. Smith')

Company name variants match

SELECT
  semantic_match_pair ('Acme Corp', 'ACME Corporation')

Clearly distinct companies do not match

SELECT
  semantic_match_pair ('Umbrella Inc', 'Initech')

SEMANTIC JOIN: sugar over match_pair

WITH
  l AS (
    SELECT
      'Acme Corp' AS name
  ),
  r AS (
    SELECT
      'ACME Corporation' AS vendor
  )
SELECT
  *
FROM
  l SEMANTIC
  JOIN r ON l.name ~ r.vendor

Nearby rabbit holes

same domain
Climb back to The Looking Glass