surfacesimilaritymatch
Similarityscalar · returns json

MATCH

Fuzzy cross-match two string arrays via bge-m3 embeddings

Per-row — runs once for each row.

similarityllmjson

Arguments

nametypedescription
leftJSONJSON array of strings (left side)
rightJSONJSON array of strings (right side)
threshold(optional)DOUBLE
top_k(optional)INTEGER

About

Fuzzy / semantic cross-match between two collections of text. Takes two JSON arrays of strings (the "left" and "right" sides), embeds them both with bge-m3 on the zoo GPU, computes a cosine similarity across the cross-product, and returns the (left, right) pairs whose similarity exceeds the supplied threshold. Use it for: • Entity resolution across two CRM snapshots • Dedup / linkage of customer records • Fuzzy product catalog merging • Retrieval-augmented generation candidate lookup The table-macro form is auto-generated by the registry — any cascade with `returns_columns` gets a `<name>_rows(...)` variant that can be used with FROM / LATERAL. Typical SQL shape: SELECT * FROM semantic_match_rows( (SELECT ARRAY_AGG(company_name) FROM salesforce), (SELECT ARRAY_AGG(name) FROM hubspot), threshold => 0.7, top_k => 5 ) WHERE score > 0.8 ORDER BY score DESC; Each row returned is (left_idx, right_idx, left_value, right_value, score). The caller can join back to the original tables by index to recover any columns beyond the matched text.

Examples

Fuzzy match picks the best pair from two company lists

SELECT
  left_value
FROM
  semantic_match_rows (
    JSON('["ACME Corporation","Globex Industries"]'),
    JSON('["acme corp","Globex Inc"]'),
    0.5,
    1
  )
ORDER BY
  score DESC
LIMIT
  1;

Nearby rabbit holes

same domain
Climb back to The Looking Glass