surfacesummarizationthemes
Summarizationaggregate · returns json

THEMES

Extract N main topics from texts (embed centrality + LLM naming)

Per-group — reads the whole group in one call.

summarizationllmscales-largejson

Syntax

TOPICS({{ texts }})
TOPICS({{ texts }}, {{ num_topics }})
THEMES({{ texts }})
THEMES({{ texts }}, {{ num_topics }})

Arguments

nametypedescription
textsJSON
num_topicsINTEGER

About

Topic extraction — extracts N main topics from a text collection. Returns a JSON array of topic name strings. Backend: hybrid. For collections up to 30 texts, the LLM reads all of them directly. For larger collections, the cascade embeds every text with bge-m3, finds the 30 most-central texts (closest to the collection centroid), and passes only those to the LLM. This caps the LLM prompt at O(30) texts regardless of collection size so THEMES scales to 100K+ rows without hitting context window limits. Pure clustering (e.g. BERTopic-style HDBSCAN + c-TF-IDF) was considered and rejected for Phase 0 because topic *naming* is where LLMs genuinely add value — a small focused LLM call over 30 representatives is cheaper and more readable than running HDBSCAN + keyword extraction and getting mechanical labels like "ml_neural_learning". For the "no scaling needed" case (small collections) this is effectively identical to the old LLM version; the specialist refactor is about unbounding the upper end of N.

Examples

Extracts themes from AI/healthcare articles

WITH
  test_data AS (
    SELECT
      *
    FROM
      (
        VALUES
          ('Machine learning is transforming healthcare'),
          ('AI models can detect cancer early'),
          ('Deep learning improves medical imaging'),
          ('Neural networks assist in diagnosis'),
          ('Healthcare AI reduces costs')
      ) AS t (article)
  )
SELECT
  THEMES (article, 3)
FROM
  test_data

Nearby rabbit holes

same domain
Climb back to The Looking Glass