AgentsSearch

DocumentProcessor

View as MarkdownOpen in Claude

DocumentProcessor handles document text extraction and chunking for search index construction. It supports multiple file formats (PDF, DOCX, HTML, Markdown, Excel, PowerPoint, RTF) and provides several chunking strategies optimized for different content types and search use cases.

1from signalwire.search import DocumentProcessor

Full document processing requires additional dependencies. Install with pip install signalwire[search-full] for PDF, DOCX, and other format support.

Properties

chunking_strategy
str

The active chunking strategy.

max_sentences_per_chunk
int

Maximum sentences per chunk when using the sentence strategy.

chunk_size
int

Word count per chunk when using the sliding strategy.

chunk_overlap
int

Word overlap between chunks when using the sliding strategy.

split_newlines
int | None

Number of consecutive newlines that trigger a split before sentence tokenization in the sentence strategy. None when not explicitly set.

semantic_threshold
float

Similarity threshold for the semantic chunking strategy.

topic_threshold
float

Similarity threshold for the topic chunking strategy.

Methods