Cli Sw Search | SignalWire

sw-search CLI

Command-line tool for building, searching, and managing vector search indexes for AI agent knowledge bases.

Overview

The sw-search tool builds vector search indexes from documents for use with the native_vector_search skill.

Capabilities:

Build indexes from documents (MD, TXT, PDF, DOCX, RST, PY)
Multiple chunking strategies for different content types
SQLite and PostgreSQL/pgvector storage backends
Interactive search shell for index exploration
Export chunks to JSON for review or external processing
Migrate indexes between backends
Search via remote API endpoints

Architecture

The system provides:

Offline Search: No external API calls or internet required
Hybrid Search: Combines vector similarity and keyword search
Smart Chunking: Intelligent document segmentation with context preservation
Advanced Query Processing: NLP-enhanced query understanding
Flexible Deployment: Local embedded mode or remote server mode
SQLite Storage: Portable .swsearch index files

Command Modes

sw-search operates in five modes:

Mode	Syntax	Purpose
build	`sw-search ./docs`	Build search index
search	`sw-search search FILE QUERY`	Search existing index
validate	`sw-search validate FILE`	Validate index integrity
migrate	`sw-search migrate FILE`	Migrate between backends
remote	`sw-search remote URL QUERY`	Search via remote API

Quick Start

$ ## Build index from documentation
$ sw-search ./docs --output knowledge.swsearch
$ 
$ ## Search the index
$ sw-search search knowledge.swsearch "how to create an agent"
$ 
$ ## Interactive search shell
$ sw-search search knowledge.swsearch --shell
$ 
$ ## Validate index
$ sw-search validate knowledge.swsearch

Building Indexes

Index Structure

Each .swsearch file is a SQLite database containing:

Document chunks with embeddings and metadata
Full-text search index (SQLite FTS5) for keyword search
Configuration and model information
Synonym cache for query expansion

This portable format allows you to build indexes once and distribute them with your agents.

Basic Usage

$ ## Build from single directory
$ sw-search ./docs
$ 
$ ## Build from multiple directories
$ sw-search ./docs ./examples --file-types md,txt,py
$ 
$ ## Build from individual files
$ sw-search README.md ./docs/guide.md ./src/main.py
$ 
$ ## Mixed sources (directories and files)
$ sw-search ./docs README.md ./examples specific_file.txt
$ 
$ ## Specify output file
$ sw-search ./docs --output ./knowledge.swsearch

Build Options

Option	Default	Description
`--output FILE`	sources.swsearch	Output file or collection
`--output-dir DIR`	(none)	Output directory
`--output-format`	index	Output: index or json
`--backend`	sqlite	Storage: sqlite or pgvector
`--file-types`	md,txt,rst	Comma-separated extensions
`--exclude`	(none)	Glob patterns to exclude
`--languages`	en	Language codes
`--tags`	(none)	Tags for all chunks
`--validate`	false	Validate after building
`--verbose`	false	Detailed output

Chunking Strategies

Choose the right strategy for your content:

Strategy	Best For	Key Options
sentence	General prose, articles	`--max-sentences-per-chunk`
sliding	Code, technical documentation	`--chunk-size`, `--overlap-size`
paragraph	Structured documents	(none)
page	PDFs with distinct pages	(none)
semantic	Coherent topic grouping	`--semantic-threshold`
topic	Long documents by subject	`--topic-threshold`
qa	Question-answering apps	(none)
markdown	Documentation with code blocks	(preserves structure)
json	Pre-chunked content	(none)

Sentence Chunking (Default)

Groups sentences together:

$ ## Default: 5 sentences per chunk
$ sw-search ./docs --chunking-strategy sentence
$ 
$ ## Custom sentence count
$ sw-search ./docs \
>   --chunking-strategy sentence \
>   --max-sentences-per-chunk 10
$ 
$ ## Split on multiple newlines
$ sw-search ./docs \
>   --chunking-strategy sentence \
>   --max-sentences-per-chunk 8 \
>   --split-newlines 2

Sliding Window Chunking

Fixed-size chunks with overlap:

$ sw-search ./docs \
>   --chunking-strategy sliding \
>   --chunk-size 100 \
>   --overlap-size 20

Paragraph Chunking

Splits on double newlines:

$ sw-search ./docs \
>   --chunking-strategy paragraph \
>   --file-types md,txt,rst

Page Chunking

Best for PDFs:

$ sw-search ./docs \
>   --chunking-strategy page \
>   --file-types pdf

Semantic Chunking

Groups semantically similar sentences:

$ sw-search ./docs \
>   --chunking-strategy semantic \
>   --semantic-threshold 0.6

Topic Chunking

Detects topic changes:

$ sw-search ./docs \
>   --chunking-strategy topic \
>   --topic-threshold 0.2

QA Chunking

Optimized for question-answering:

$ sw-search ./docs --chunking-strategy qa

Markdown Chunking

The markdown strategy is specifically designed for documentation that contains code examples. It understands markdown structure and adds rich metadata for better search results.

$ sw-search ./docs \
>   --chunking-strategy markdown \
>   --file-types md

Features:

Header-based chunking: Splits at markdown headers (h1, h2, h3…) for natural boundaries
Code block detection: Identifies fenced code blocks and extracts language (python, bash, etc.)
Smart tagging: Adds "code" tags to chunks with code, plus language-specific tags
Section hierarchy: Preserves full path (e.g., “API Reference > AgentBase > Methods”)
Code protection: Never splits inside code blocks
Metadata enrichment: Header levels stored as searchable metadata

Example Metadata:

1 {
2   "chunk_type": "markdown",
3   "h1": "API Reference",
4   "h2": "AgentBase",
5   "h3": "add_skill Method",
6   "has_code": true,
7   "code_languages": ["python", "bash"],
8   "tags": ["code", "code:python", "code:bash", "depth:3"]
9 }

Search Benefits:

When users search for “example code Python”:

Chunks with code blocks get automatic 20% boost
Python-specific code gets language match bonus
Vector similarity provides primary semantic ranking
Metadata tags provide confirmation signals
Results blend semantic + structural relevance

Best Used With:

API documentation with code examples
Tutorial content with inline code
Technical guides with multiple languages
README files with usage examples

Usage with pgvector:

$ sw-search ./docs \
>   --backend pgvector \
>   --connection-string "postgresql://user:pass@localhost:5432/db" \
>   --output docs_collection \
>   --chunking-strategy markdown

JSON Chunking

The json strategy allows you to provide pre-chunked content in a structured format. This is useful when you need custom control over how documents are split and indexed.

Expected JSON Format:

1 {
2   "chunks": [
3     {
4       "chunk_id": "unique_id",
5       "type": "content",
6       "content": "The actual text content",
7       "metadata": {
8         "section": "Introduction",
9         "url": "https://example.com/docs/intro",
10         "custom_field": "any_value"
11       },
12       "tags": ["intro", "getting-started"]
13     }
14   ]
15 }

Usage:

$ ## First preprocess your documents into JSON chunks
$ python your_preprocessor.py input.txt -o chunks.json
$ 
$ ## Then build the index using JSON strategy
$ sw-search chunks.json --chunking-strategy json --file-types json

Best Used For:

API documentation with complex structure
Documents that need custom parsing logic
Preserving specific metadata relationships
Integration with external preprocessing tools

Model Selection

Choose embedding model based on speed vs quality:

Alias	Model	Dims	Speed	Quality
mini	all-MiniLM-L6-v2	384	~5x	Good
base	all-mpnet-base-v2	768	1x	High
large	all-mpnet-base-v2	768	1x	Highest

$ ## Fast model (default, recommended for most cases)
$ sw-search ./docs --model mini
$ 
$ ## Balanced model
$ sw-search ./docs --model base
$ 
$ ## Best quality
$ sw-search ./docs --model large
$ 
$ ## Full model name
$ sw-search ./docs --model sentence-transformers/all-mpnet-base-v2

File Filtering

$ ## Specific file types
$ sw-search ./docs --file-types md,txt,rst,py
$ 
$ ## Exclude patterns
$ sw-search ./docs --exclude "**/test/**,**/__pycache__/**,**/.git/**"
$ 
$ ## Language filtering
$ sw-search ./docs --languages en,es,fr

Tags and Metadata

Add tags during build for filtered searching:

$ ## Add tags to all chunks
$ sw-search ./docs --tags documentation,api,v2
$ 
$ ## Filter by tags when searching
$ sw-search search index.swsearch "query" --tags documentation

Searching Indexes

Basic Search

$ ## Search with query
$ sw-search search knowledge.swsearch "how to create an agent"
$ 
$ ## Limit results
$ sw-search search knowledge.swsearch "API reference" --count 3
$ 
$ ## Verbose output with scores
$ sw-search search knowledge.swsearch "configuration" --verbose

Search Options

Option	Default	Description
`--count`	5	Number of results
`--distance-threshold`	0.0	Minimum similarity score
`--tags`	(none)	Filter by tags
`--query-nlp-backend`	nltk	NLP backend: nltk or spacy
`--keyword-weight`	(auto)	Manual keyword weight (0.0-1.0)
`--model`	(index)	Override embedding model
`--json`	false	Output as JSON
`--no-content`	false	Hide content, show metadata only
`--verbose`	false	Detailed output

Output Formats

$ ## Human-readable (default)
$ sw-search search knowledge.swsearch "query"
$ 
$ ## JSON output
$ sw-search search knowledge.swsearch "query" --json
$ 
$ ## Metadata only
$ sw-search search knowledge.swsearch "query" --no-content
$ 
$ ## Full verbose output
$ sw-search search knowledge.swsearch "query" --verbose

Filter by Tags

$ ## Single tag
$ sw-search search knowledge.swsearch "functions" --tags documentation
$ 
$ ## Multiple tags
$ sw-search search knowledge.swsearch "API" --tags api,reference

Interactive Search Shell

Load index once and search multiple times:

$ sw-search search knowledge.swsearch --shell

Shell commands:

Command	Description
`help`	Show help
`exit`/`quit`/`q`	Exit shell
`count=N`	Set result count
`tags=tag1,tag2`	Set tag filter
`verbose`	Toggle verbose output
`<query>`	Search for query

Example session:

$ sw-search search knowledge.swsearch --shell
Search Shell - Index: knowledge.swsearch
Backend: sqlite
Index contains 1523 chunks from 47 files
Model: sentence-transformers/all-MiniLM-L6-v2
Type 'exit' or 'quit' to leave, 'help' for options
------------------------------------------------------------
search> how to create an agent
Found 5 result(s) for 'how to create an agent' (0.034s):
...
search> count=3
Result count set to: 3
search> SWAIG functions
Found 3 result(s) for 'SWAIG functions' (0.028s):
...
search> exit
Goodbye!

PostgreSQL/pgvector Backend

The search system supports multiple storage backends. Choose based on your deployment needs:

Backend Comparison

Feature	SQLite	pgvector
Setup complexity	None	Requires PostgreSQL
Scalability	Limited	Excellent
Concurrent access	Poor	Excellent
Update capability	Rebuild required	Real-time
Performance (small datasets)	Excellent	Good
Performance (large datasets)	Poor	Excellent
Deployment	File copy	Database connection
Multi-agent support	Separate copies	Shared knowledge base

SQLite Backend (Default):

File-based .swsearch indexes
Portable single-file format
No external dependencies
Best for: Single-agent deployments, development, small to medium datasets

pgvector Backend:

Server-based PostgreSQL storage
Efficient similarity search with IVFFlat/HNSW indexes
Multiple agents can share the same knowledge base
Real-time updates without rebuilding
Best for: Production deployments, multi-agent systems, large datasets

Building with pgvector

$ ## Build to pgvector
$ sw-search ./docs \
>   --backend pgvector \
>   --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
>   --output docs_collection
$ 
$ ## With markdown strategy
$ sw-search ./docs \
>   --backend pgvector \
>   --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
>   --output docs_collection \
>   --chunking-strategy markdown
$ 
$ ## Overwrite existing collection
$ sw-search ./docs \
>   --backend pgvector \
>   --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
>   --output docs_collection \
>   --overwrite

Search pgvector Collection

$ sw-search search docs_collection "how to create an agent" \
>   --backend pgvector \
>   --connection-string "postgresql://user:pass@localhost/knowledge"

Migration

Migrate indexes between backends:

$ ## Get index information
$ sw-search migrate --info ./docs.swsearch
$ 
$ ## Migrate SQLite to pgvector
$ sw-search migrate ./docs.swsearch --to-pgvector \
>   --connection-string "postgresql://user:pass@localhost/db" \
>   --collection-name docs_collection
$ 
$ ## Migrate with overwrite
$ sw-search migrate ./docs.swsearch --to-pgvector \
>   --connection-string "postgresql://user:pass@localhost/db" \
>   --collection-name docs_collection \
>   --overwrite

Migration Options

Option	Description
`--info`	Show index information
`--to-pgvector`	Migrate SQLite to pgvector
`--to-sqlite`	Migrate pgvector to SQLite (planned)
`--connection-string`	PostgreSQL connection string
`--collection-name`	Target collection name
`--overwrite`	Overwrite existing collection
`--batch-size`	Chunks per batch (default: 100)

Local vs Remote Modes

The search skill supports both local and remote operation modes.

Local Mode (Default)

Searches are performed directly in the agent process using the embedded search engine.

Pros:

Faster (no network latency)
Works offline
Simple deployment
Lower operational complexity

Cons:

Higher memory usage per agent
Index files must be distributed with each agent
Updates require redeploying agents

Configuration in Agent:

1 self.add_skill("native_vector_search", {
2     "tool_name": "search_docs",
3     "index_file": "docs.swsearch",  # Local file
4     "nlp_backend": "nltk"
5 })

Remote Mode

Searches are performed via HTTP API to a centralized search server.

Pros:

Lower memory usage per agent
Centralized index management
Easy updates without redeploying agents
Better scalability for multiple agents
Shared resources

Cons:

Network dependency
Additional infrastructure complexity
Potential latency

Configuration in Agent:

1 self.add_skill("native_vector_search", {
2     "tool_name": "search_docs",
3     "remote_url": "http://localhost:8001",  # Search server
4     "index_name": "docs",
5     "nlp_backend": "nltk"
6 })

Automatic Mode Detection

The skill automatically detects which mode to use:

If remote_url is provided → Remote mode
If index_file is provided → Local mode
Remote mode takes priority if both are specified

Running a Remote Search Server

Start the search server:

$ python examples/search_server_standalone.py

The server provides HTTP API:
- POST /search - Search the indexes
- GET /health - Health check and available indexes
- POST /reload_index - Add or reload an index
Test the API:

$ curl -X POST "http://localhost:8001/search" \
>      -H "Content-Type: application/json" \
>      -d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Remote Search CLI

Search via remote API endpoint from the command line:

$ ## Basic remote search
$ sw-search remote http://localhost:8001 "how to create an agent" \
>   --index-name docs
$ 
$ ## With options
$ sw-search remote localhost:8001 "API reference" \
>   --index-name docs \
>   --count 3 \
>   --verbose
$ 
$ ## JSON output
$ sw-search remote localhost:8001 "query" \
>   --index-name docs \
>   --json

Remote Options

Option	Default	Description
`--index-name`	(required)	Name of the index to search
`--count`	5	Number of results
`--distance-threshold`	0.0	Minimum similarity score
`--tags`	(none)	Filter by tags
`--timeout`	30	Request timeout in seconds
`--json`	false	Output as JSON
`--no-content`	false	Hide content
`--verbose`	false	Detailed output

Validation

Verify index integrity:

$ ## Validate index
$ sw-search validate ./docs.swsearch
$ 
$ ## Verbose validation
$ sw-search validate ./docs.swsearch --verbose

Output:

✓ Index is valid: ./docs.swsearch
  Chunks: 1523
  Files: 47
Configuration:
  embedding_model: sentence-transformers/all-MiniLM-L6-v2
  embedding_dimensions: 384
  chunking_strategy: markdown
  created_at: 2025-01-15T10:30:00

JSON Export

Export chunks for review or external processing:

$ ## Export to single JSON file
$ sw-search ./docs \
>   --output-format json \
>   --output all_chunks.json
$ 
$ ## Export to directory (one file per source)
$ sw-search ./docs \
>   --output-format json \
>   --output-dir ./chunks/
$ 
$ ## Build index from exported JSON
$ sw-search ./chunks/ \
>   --chunking-strategy json \
>   --file-types json \
>   --output final.swsearch

NLP Backend Selection

Choose NLP backend for processing:

Backend	Speed	Quality	Install Size
nltk	Fast	Good	Included
spacy	Slower	Better	Requires: `pip install signalwire-agents[search-nlp]`

$ ## Index with NLTK (default)
$ sw-search ./docs --index-nlp-backend nltk
$ 
$ ## Index with spaCy (better quality)
$ sw-search ./docs --index-nlp-backend spacy
$ 
$ ## Query with NLTK
$ sw-search search index.swsearch "query" --query-nlp-backend nltk
$ 
$ ## Query with spaCy
$ sw-search search index.swsearch "query" --query-nlp-backend spacy

Complete Configuration Example

$ sw-search ./docs ./examples README.md \
>   --output ./knowledge.swsearch \
>   --chunking-strategy sentence \
>   --max-sentences-per-chunk 8 \
>   --file-types md,txt,rst,py \
>   --exclude "**/test/**,**/__pycache__/**" \
>   --languages en,es,fr \
>   --model sentence-transformers/all-mpnet-base-v2 \
>   --tags documentation,api \
>   --index-nlp-backend nltk \
>   --validate \
>   --verbose

Using with Skills

After building an index, use it with the native_vector_search skill:

1 from signalwire_agents import AgentBase
2 
3 agent = AgentBase(name="search-agent")
4 
5 ## Add search skill with built index
6 agent.add_skill("native_vector_search", {
7     "index_path": "./knowledge.swsearch",
8     "tool_name": "search_docs",
9     "tool_description": "Search the documentation"
10 })

Output Formats

Format	Extension	Description
swsearch	.swsearch	SQLite-based portable index (default)
json	.json	JSON export of chunks
pgvector	(database)	PostgreSQL with pgvector extension

Installation Requirements

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)

$ pip install "signalwire-agents[search]"

Includes:

Core search functionality
Sentence transformers for embeddings
SQLite FTS5 for keyword search
Basic document processing (text, markdown)

Full Document Processing (~600MB)

$ pip install "signalwire-agents[search-full]"

Adds:

PDF processing (PyPDF2)
DOCX processing (python-docx)
HTML processing (BeautifulSoup4)
Additional file format support

Advanced NLP Features (~700MB)

$ pip install "signalwire-agents[search-nlp]"

Adds:

spaCy for advanced text processing
NLTK for linguistic analysis
Enhanced query preprocessing
Language detection

Additional Setup Required:

$ python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

All Search Features (~700MB)

$ pip install "signalwire-agents[search-all]"

Includes everything above.

Additional Setup Required:

$ python -m spacy download en_core_web_sm

Query-Only Mode (~400MB)

$ pip install "signalwire-agents[search-queryonly]"

For agents that only need to query pre-built indexes without building new ones.

PostgreSQL Vector Support

$ pip install "signalwire-agents[pgvector]"

Adds PostgreSQL with pgvector extension support for production deployments.

NLP Backend Selection

You can choose which NLP backend to use for query processing:

Backend	Speed	Quality	Notes
nltk	Fast (~50-100ms)	Good	Default, good for most use cases
spacy	Slower (~150-300ms)	Better	Better POS tagging and entity recognition

Configure via --index-nlp-backend (build) or --query-nlp-backend (search) flags.

API Reference

For programmatic access to the search system, use the Python API directly.

SearchEngine Class

1 from signalwire_agents.search import SearchEngine
2 
3 ## Load an index
4 engine = SearchEngine("docs.swsearch")
5 
6 ## Perform search
7 results = engine.search(
8     query_vector=[...],  # Optional: pre-computed query vector
9     enhanced_text="search query",  # Enhanced query text
10     count=5,  # Number of results
11     similarity_threshold=0.0,  # Minimum similarity score
12     tags=["documentation"]  # Filter by tags
13 )
14 
15 ## Get index statistics
16 stats = engine.get_stats()
17 print(f"Total chunks: {stats['total_chunks']}")
18 print(f"Total files: {stats['total_files']}")

IndexBuilder Class

1 from signalwire_agents.search import IndexBuilder
2 
3 ## Create index builder
4 builder = IndexBuilder(
5     model_name="sentence-transformers/all-mpnet-base-v2",
6     chunk_size=500,
7     chunk_overlap=50,
8     verbose=True
9 )
10 
11 ## Build index
12 builder.build_index(
13     source_dir="./docs",
14     output_file="docs.swsearch",
15     file_types=["md", "txt"],
16     exclude_patterns=["**/test/**"],
17     tags=["documentation"]
18 )

Troubleshooting

Issue	Solution
Search not available	`pip install signalwire-agents[search]`
pgvector errors	`pip install signalwire-agents[pgvector]`
PDF processing fails	`pip install signalwire-agents[search-full]`
spaCy not found	`pip install signalwire-agents[search-nlp]`
No results found	Try different chunking strategy
Poor search quality	Use `--model base` or larger chunks
Index too large	Use `--model mini`, reduce file types
Connection refused (remote)	Check search server is running

native_vector_search Skill - Using search indexes in agents
Skills Overview - Adding skills to agents
DataSphere Integration - Cloud-based search alternative

sw-search CLI

Overview

Architecture

Command Modes

Quick Start

Building Indexes

Index Structure

Basic Usage

Build Options

Chunking Strategies

Sentence Chunking (Default)

Sliding Window Chunking

Paragraph Chunking

Page Chunking

Semantic Chunking

Topic Chunking

QA Chunking

Markdown Chunking

JSON Chunking

Model Selection

File Filtering

Tags and Metadata

Searching Indexes

Basic Search

Search Options

Output Formats

Filter by Tags

Interactive Search Shell

PostgreSQL/pgvector Backend

Backend Comparison

Building with pgvector

Search pgvector Collection

Migration

Migration Options

Local vs Remote Modes

Local Mode (Default)

Remote Mode

Automatic Mode Detection

Running a Remote Search Server

Remote Search CLI

Remote Options

Validation

JSON Export

NLP Backend Selection

Complete Configuration Example

Using with Skills

Output Formats

Installation Requirements

Basic Search (~500MB)

Full Document Processing (~600MB)

Advanced NLP Features (~700MB)

All Search Features (~700MB)

Query-Only Mode (~400MB)

PostgreSQL Vector Support

NLP Backend Selection

API Reference

SearchEngine Class

IndexBuilder Class

Troubleshooting

Related Documentation