surfaceclassificationclassify_collection
Classificationaggregate · returns varchar

CLASSIFY_COLLECTION

Classify a text collection into one category (embedding majority vote)

Per-group — reads the whole group in one call.

classificationllmscales-largejson

Syntax

CLASSIFY({{ texts }}, '{{ categories }}')
CLASSIFY({{ texts }}, '{{ categories }}', '{{ prompt }}')

Arguments

nametypedescription
textsJSON
categoriesVARCHAR
prompt(optional)VARCHAR

About

Classify an entire collection of texts into ONE of the provided categories based on overall content. Used by CLASSIFY(col, categories). Backend: specialist zoo bge-m3 embeddings + per-text argmax majority vote. For each text in the collection, the cascade picks that text's best-fitting label via cosine similarity against the candidate label embeddings, then takes a majority vote across the whole collection. Ties are broken by total summed similarity. Why majority vote instead of averaging embeddings? Averaging the text embeddings first and then comparing to labels drags the centroid toward outliers — e.g., a 3-tech-1-cooking collection centroids between tech and cooking and often picks cooking. Per-text argmax is outlier-robust: each text gets one vote, the dominant theme wins cleanly. Complexity is O(N+K) embeddings + O(N*K) dot products, all on the GPU — scales to 100K+ texts in seconds. The LLM version shoves every text into a single prompt and hits context-window limits beyond a few hundred rows. Optional `prompt` argument: prefixes the candidate labels when embedding, letting you steer the classification criterion. For example, prompt="musical genre" + labels="rock,jazz,classical" embeds as "musical genre: rock", "musical genre: jazz", etc. For LLM-style classification with custom criteria, use CLASSIFY_LLM — see classify_llm.cascade.yaml.

Examples

Classifies pet-related texts correctly

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('The cat sat on the mat'),
          ('Dogs love to play fetch'),
          ('My parrot talks all day'),
          ('Fish swim in the tank')
      ) AS t (text)
  )
SELECT
  CLASSIFY (text, 'animal stories, food recipes, sports news')
FROM
  test_data

Nearby rabbit holes

same domain
Climb back to The Looking Glass