Helper Functions & Constants
Helper Functions & Constants
The Search module provides standalone helper functions for query preprocessing, document content preprocessing, and embedding model alias resolution, along with constants for model configuration.
These functions require search dependencies. Install with
pip install signalwire[search].
Functions
preprocess_query
preprocess_query(query, language="en", pos_to_expand=None, max_synonyms=5, debug=False, vector=False, query_nlp_backend="nltk", model_name=None, preserve_original=True) -> dict[str, Any]
Preprocess a search query with language detection, tokenization, stop word removal,
POS tagging, synonym expansion, stemming, and optional vectorization. This function
is used internally by SearchService
and SearchEngine but can
also be called directly for custom search pipelines.
Parameters
query
Input query string.
language
Language code (e.g., "en", "es", "fr") or "auto" for automatic detection.
pos_to_expand
POS tags to expand with synonyms. Defaults to ["NOUN", "VERB", "ADJ"].
max_synonyms
Maximum number of synonyms to add per word.
debug
Enable debug logging output.
vector
Include a vector embedding of the query in the output. Set to True when
passing the result to SearchEngine.search().
query_nlp_backend
NLP backend for query processing. Valid values:
"nltk"— fast, lightweight (default)"spacy"— better quality, requires spaCy models
model_name
Sentence transformer model name for vectorization. Must match the model used to build the index being searched. If not specified, uses the default model.
preserve_original
Keep the original query terms in the enhanced text alongside expanded synonyms and stems.
Returns
dict[str, Any] — A dictionary containing:
input(str) — the original query string as passed inenhanced_text(str) — the preprocessed query text with synonyms and stemslanguage(str) — detected or specified language codePOS(dict) — POS tag analysis resultsvector(list[float]) — embedding vector (only whenvector=True)
Example
preprocess_document_content
preprocess_document_content(content, language="en", index_nlp_backend="nltk") -> dict[str, Any]
Preprocess document content for indexing. Uses less aggressive synonym expansion than query preprocessing to keep document representations focused.
This function is called internally by
IndexBuilder during
index construction.
Parameters
content
Document text content to preprocess.
language
Language code for processing.
index_nlp_backend
NLP backend for processing. "nltk" or "spacy".
Returns
dict[str, Any] — A dictionary containing:
enhanced_text(str) — the preprocessed document textkeywords(list[str]) — up to 20 extracted keywords (stop words removed)language(str) — the language used for processingpos_analysis(dict) — POS tag analysis
Example
resolve_model_alias
resolve_model_alias(model_name) -> str
Resolve a short model alias to its full model name. If the input is not a known alias, it is returned unchanged.
Parameters
model_name
A model alias or full model name. Known aliases:
"mini"—sentence-transformers/all-MiniLM-L6-v2(384 dims, fastest)"base"—sentence-transformers/all-mpnet-base-v2(768 dims, balanced)"large"—sentence-transformers/all-mpnet-base-v2(768 dims, same as base)
Returns
str — The full sentence transformer model name.
Example
Constants
MODEL_ALIASES
Dictionary mapping short model aliases to full sentence transformer model names.
DEFAULT_MODEL
The default embedding model used for new indexes. This is the "mini" model,
chosen for its smaller size and faster inference. Use the "base" alias or
specify a full model name when higher embedding quality is needed.