surfacecleaningclean_year
Cleaningscalar · returns integer

CLEAN_YEAR

Extracts 4-digit year from messy text, returns -1 if undetermined

Per-row — runs once for each row.

cleaningllmdate

Syntax

CLEAN_YEAR({{ text }})

Arguments

nametypedescription
textVARCHARRaw text to extract year from

About

Scalar function to extract a 4-digit year from messy string data. Returns -1 when no determinable year can be extracted. Backend: deterministic Python regex + heuristics. Handles: - Standard: "2023", "1995" - Two-digit: "95" → 1995, "05" → 2005 (00-30 → 2000-2030, 31-99 → 1931-1999) - Decades: "the 90s", "early 2000s", "'80s" → decade start - Approximate: "circa 1995", "around 2010", "~1990" - Dates: "2023-01-15", "Jan 2023", "15/03/1995" - Fiscal year: "FY2023", "FY 23" - Ranges: "1990-1995" → first year (1990) - Spelled-out years ("nineteen eighty-four") are NOT handled here — use CLEAN_YEAR_LLM for that. For LLM-style extraction (spelled-out years, unusual phrasings, language-specific formats), use CLEAN_YEAR_LLM — see clean_year_llm.cascade.yaml.

Examples

Fiscal year cleaned

SELECT
  clean_year ('FY2023')

Approximate year cleaned

SELECT
  clean_year ('circa 1990')

Decade start

SELECT
  clean_year ('the 90s')

ISO date component

SELECT
  clean_year ('2024-03-15')

Unparseable returns -1

SELECT
  clean_year ('garbage')

Nearby rabbit holes

same domain
Climb back to The Looking Glass