DocumentProcessor
DocumentProcessor handles document text extraction and chunking for search index
construction. It supports multiple file formats (PDF, DOCX, HTML, Markdown, Excel,
PowerPoint, RTF) and provides several chunking strategies optimized for different
content types and search use cases.
Full document processing requires additional dependencies. Install with
pip install signalwire[search-full] for PDF, DOCX, and other format support.
Properties
chunking_strategy
The active chunking strategy.
max_sentences_per_chunk
Maximum sentences per chunk when using the sentence strategy.
chunk_size
Word count per chunk when using the sliding strategy.
chunk_overlap
Word overlap between chunks when using the sliding strategy.
split_newlines
Number of consecutive newlines that trigger a split before sentence tokenization
in the sentence strategy. None when not explicitly set.
semantic_threshold
Similarity threshold for the semantic chunking strategy.
topic_threshold
Similarity threshold for the topic chunking strategy.