Helper Functions & Constants

The Search module provides standalone helper functions for query preprocessing, document content preprocessing, and embedding model alias resolution, along with constants for model configuration.

1 from signalwire.search import (
2     preprocess_query,
3     preprocess_document_content,
4     resolve_model_alias,
5     MODEL_ALIASES,
6     DEFAULT_MODEL,
7 )

These functions require search dependencies. Install with pip install signalwire-sdk[search].

Functions

preprocess_query

preprocess_query(query, language="en", pos_to_expand=None, max_synonyms=5, debug=False, vector=False, query_nlp_backend="nltk", model_name=None, preserve_original=True) -> dict[str, Any]

Preprocess a search query with language detection, tokenization, stop word removal, POS tagging, synonym expansion, stemming, and optional vectorization. This function is used internally by SearchService and SearchEngine but can also be called directly for custom search pipelines.

Parameters

query

strRequired

Input query string.

language

strDefaults to en

Language code (e.g., "en", "es", "fr") or "auto" for automatic detection.

pos_to_expand

Optional[list[str]]

POS tags to expand with synonyms. Defaults to ["NOUN", "VERB", "ADJ"].

max_synonyms

intDefaults to 5

Maximum number of synonyms to add per word.

debug

boolDefaults to false

Enable debug logging output.

vector

boolDefaults to false

Include a vector embedding of the query in the output. Set to True when passing the result to SearchEngine.search().

query_nlp_backend

strDefaults to nltk

NLP backend for query processing. Valid values:

"nltk" — fast, lightweight (default)
"spacy" — better quality, requires spaCy models

model_name

Optional[str]

Sentence transformer model name for vectorization. Must match the model used to build the index being searched. If not specified, uses the default model.

preserve_original

boolDefaults to true

Keep the original query terms in the enhanced text alongside expanded synonyms and stems.

Returns

dict[str, Any] — A dictionary containing:

input (str) — the original query string as passed in
enhanced_text (str) — the preprocessed query text with synonyms and stems
language (str) — detected or specified language code
POS (dict) — POS tag analysis results
vector (list[float]) — embedding vector (only when vector=True)

Example

1 from signalwire.search import preprocess_query
2 
3 # Basic preprocessing
4 result = preprocess_query("How do I configure voice agents?")
5 print(result["enhanced_text"])
6 
7 # With vectorization for search
8 result = preprocess_query(
9     "How do I configure voice agents?",
10     vector=True,
11     language="auto",
12 )
13 query_vector = result["vector"]
14 enhanced_text = result["enhanced_text"]

preprocess_document_content

preprocess_document_content(content, language="en", index_nlp_backend="nltk") -> dict[str, Any]

Preprocess document content for indexing. Uses less aggressive synonym expansion than query preprocessing to keep document representations focused.

This function is called internally by IndexBuilder during index construction.

Parameters

content

strRequired

Document text content to preprocess.

language

strDefaults to en

Language code for processing.

index_nlp_backend

strDefaults to nltk

NLP backend for processing. "nltk" or "spacy".

Returns

dict[str, Any] — A dictionary containing:

enhanced_text (str) — the preprocessed document text
keywords (list[str]) — up to 20 extracted keywords (stop words removed)
language (str) — the language used for processing
pos_analysis (dict) — POS tag analysis

Example

1 from signalwire.search import preprocess_document_content
2 
3 result = preprocess_document_content(
4     "SignalWire agents can be configured with custom prompts and tools.",
5     language="en",
6 )
7 print(result["keywords"])
8 # ['signalwire', 'agents', 'configured', 'custom', 'prompts', 'tools']

resolve_model_alias

resolve_model_alias(model_name) -> str

Resolve a short model alias to its full model name. If the input is not a known alias, it is returned unchanged.

Parameters

model_name

strRequired

A model alias or full model name. Known aliases:

"mini" — sentence-transformers/all-MiniLM-L6-v2 (384 dims, fastest)
"base" — sentence-transformers/all-mpnet-base-v2 (768 dims, balanced)
"large" — sentence-transformers/all-mpnet-base-v2 (768 dims, same as base)

Returns

str — The full sentence transformer model name.

Example

1 from signalwire.search import resolve_model_alias
2 
3 print(resolve_model_alias("mini"))
4 # "sentence-transformers/all-MiniLM-L6-v2"
5 
6 print(resolve_model_alias("sentence-transformers/all-mpnet-base-v2"))
7 # "sentence-transformers/all-mpnet-base-v2"  (unchanged)

Constants

MODEL_ALIASES

1 from signalwire.search import MODEL_ALIASES
2 
3 print(MODEL_ALIASES)  # dict[str, str]

Dictionary mapping short model aliases to full sentence transformer model names.

Alias	Full Model Name	Dimensions
`"mini"`	`sentence-transformers/all-MiniLM-L6-v2`	384
`"base"`	`sentence-transformers/all-mpnet-base-v2`	768
`"large"`	`sentence-transformers/all-mpnet-base-v2`	768

DEFAULT_MODEL

1 from signalwire.search import DEFAULT_MODEL
2 
3 print(DEFAULT_MODEL)  # "sentence-transformers/all-MiniLM-L6-v2"

The default embedding model used for new indexes. This is the "mini" model, chosen for its smaller size and faster inference. Use the "base" alias or specify a full model name when higher embedding quality is needed.