surfacerankingoutliers
Rankingaggregate · returns varchar

OUTLIERS

Find unusual or atypical items via embeddings (+ optional criteria)

Per-group — reads the whole group in one call.

rankingrerankerspecialist-zoojson

Syntax

OUTLIERS({{ texts }})
OUTLIERS({{ texts }}, {{ num_outliers }})
OUTLIERS({{ texts }}, {{ num_outliers }}, '{{ criteria }}')

Arguments

nametypedescription
textsJSON
num_outliersINTEGER
criteria(optional)VARCHAR

About

Find outliers — items that don't fit the pattern — in a collection. Returns a JSON array of the N most-unusual items. Two modes depending on whether a `criteria` argument is provided: 1. **Criteria mode (recommended).** When you know what the collection is "supposed to be about", pass a criterion phrase and the cascade uses the cross-encoder reranker (bge-reranker-v2-m3) to score each item against the criterion. The N items with the LOWEST relevance scores are returned as outliers. This matches LLM-style categorical outlier detection cleanly: OUTLIERS(item, 1, 'a type of fruit') → finds 'chicken' in a list of fruits 2. **Unsupervised mode (no criteria).** When no criterion is given, the cascade falls back to nearest-neighbor distance on the embedding space: items whose closest neighbor is far away are "isolated" and ranked as outliers. This works well when items are naturally clustered but can be misled when the collection has no clear topical grouping — the embedding model may see lexically-distinctive items (e.g., "strawberry" among common fruits) as more outlier-ish than taxonomically-distinct items (e.g., "chicken"), because taxonomy isn't a geometric property. If your semantics depend on category membership, pass criteria. For LLM-style categorical outlier detection WITHOUT needing to spell out criteria (the old "just use world knowledge"), use OUTLIERS_LLM — see outliers_llm.cascade.yaml.

Examples

Criteria-guided — correctly identifies chicken as non-fruit outlier

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('apple'),
          ('banana'),
          ('orange'),
          ('grape'),
          ('strawberry'),
          ('watermelon'),
          ('chicken'),
          ('mango'),
          ('pineapple')
      ) AS t (item)
  )
SELECT
  OUTLIERS (item, 1, 'a type of fruit')
FROM
  test_data

Nearby rabbit holes

same domain
Climb back to The Looking Glass