Cli Sw Search

View as Markdown

sw-search CLI

Command-line tool for building, searching, and managing vector search indexes for AI agent knowledge bases.

Overview

The sw-search tool builds vector search indexes from documents for use with the native_vector_search skill.

Capabilities:

  • Build indexes from documents (MD, TXT, PDF, DOCX, RST, PY)
  • Multiple chunking strategies for different content types
  • SQLite and PostgreSQL/pgvector storage backends
  • Interactive search shell for index exploration
  • Export chunks to JSON for review or external processing
  • Migrate indexes between backends
  • Search via remote API endpoints

Architecture

Search Architecture.
Search Architecture

The system provides:

  • Offline Search: No external API calls or internet required
  • Hybrid Search: Combines vector similarity and keyword search
  • Smart Chunking: Intelligent document segmentation with context preservation
  • Advanced Query Processing: NLP-enhanced query understanding
  • Flexible Deployment: Local embedded mode or remote server mode
  • SQLite Storage: Portable .swsearch index files

Command Modes

sw-search operates in five modes:

ModeSyntaxPurpose
buildsw-search ./docsBuild search index
searchsw-search search FILE QUERYSearch existing index
validatesw-search validate FILEValidate index integrity
migratesw-search migrate FILEMigrate between backends
remotesw-search remote URL QUERYSearch via remote API

Quick Start

$## Build index from documentation
$sw-search ./docs --output knowledge.swsearch
$
$## Search the index
$sw-search search knowledge.swsearch "how to create an agent"
$
$## Interactive search shell
$sw-search search knowledge.swsearch --shell
$
$## Validate index
$sw-search validate knowledge.swsearch

Building Indexes

Index Structure

Each .swsearch file is a SQLite database containing:

  • Document chunks with embeddings and metadata
  • Full-text search index (SQLite FTS5) for keyword search
  • Configuration and model information
  • Synonym cache for query expansion

This portable format allows you to build indexes once and distribute them with your agents.

Basic Usage

$## Build from single directory
$sw-search ./docs
$
$## Build from multiple directories
$sw-search ./docs ./examples --file-types md,txt,py
$
$## Build from individual files
$sw-search README.md ./docs/guide.md ./src/main.py
$
$## Mixed sources (directories and files)
$sw-search ./docs README.md ./examples specific_file.txt
$
$## Specify output file
$sw-search ./docs --output ./knowledge.swsearch

Build Options

OptionDefaultDescription
--output FILEsources.swsearchOutput file or collection
--output-dir DIR(none)Output directory
--output-formatindexOutput: index or json
--backendsqliteStorage: sqlite or pgvector
--file-typesmd,txt,rstComma-separated extensions
--exclude(none)Glob patterns to exclude
--languagesenLanguage codes
--tags(none)Tags for all chunks
--validatefalseValidate after building
--verbosefalseDetailed output

Chunking Strategies

Choose the right strategy for your content:

StrategyBest ForKey Options
sentenceGeneral prose, articles--max-sentences-per-chunk
slidingCode, technical documentation--chunk-size, --overlap-size
paragraphStructured documents(none)
pagePDFs with distinct pages(none)
semanticCoherent topic grouping--semantic-threshold
topicLong documents by subject--topic-threshold
qaQuestion-answering apps(none)
markdownDocumentation with code blocks(preserves structure)
jsonPre-chunked content(none)

Sentence Chunking (Default)

Groups sentences together:

$## Default: 5 sentences per chunk
$sw-search ./docs --chunking-strategy sentence
$
$## Custom sentence count
$sw-search ./docs \
> --chunking-strategy sentence \
> --max-sentences-per-chunk 10
$
$## Split on multiple newlines
$sw-search ./docs \
> --chunking-strategy sentence \
> --max-sentences-per-chunk 8 \
> --split-newlines 2

Sliding Window Chunking

Fixed-size chunks with overlap:

$sw-search ./docs \
> --chunking-strategy sliding \
> --chunk-size 100 \
> --overlap-size 20

Paragraph Chunking

Splits on double newlines:

$sw-search ./docs \
> --chunking-strategy paragraph \
> --file-types md,txt,rst

Page Chunking

Best for PDFs:

$sw-search ./docs \
> --chunking-strategy page \
> --file-types pdf

Semantic Chunking

Groups semantically similar sentences:

$sw-search ./docs \
> --chunking-strategy semantic \
> --semantic-threshold 0.6

Topic Chunking

Detects topic changes:

$sw-search ./docs \
> --chunking-strategy topic \
> --topic-threshold 0.2

QA Chunking

Optimized for question-answering:

$sw-search ./docs --chunking-strategy qa

Markdown Chunking

The markdown strategy is specifically designed for documentation that contains code examples. It understands markdown structure and adds rich metadata for better search results.

$sw-search ./docs \
> --chunking-strategy markdown \
> --file-types md

Features:

  • Header-based chunking: Splits at markdown headers (h1, h2, h3…) for natural boundaries
  • Code block detection: Identifies fenced code blocks and extracts language (python, bash, etc.)
  • Smart tagging: Adds "code" tags to chunks with code, plus language-specific tags
  • Section hierarchy: Preserves full path (e.g., “API Reference > AgentBase > Methods”)
  • Code protection: Never splits inside code blocks
  • Metadata enrichment: Header levels stored as searchable metadata

Example Metadata:

1{
2 "chunk_type": "markdown",
3 "h1": "API Reference",
4 "h2": "AgentBase",
5 "h3": "add_skill Method",
6 "has_code": true,
7 "code_languages": ["python", "bash"],
8 "tags": ["code", "code:python", "code:bash", "depth:3"]
9}

Search Benefits:

When users search for “example code Python”:

  • Chunks with code blocks get automatic 20% boost
  • Python-specific code gets language match bonus
  • Vector similarity provides primary semantic ranking
  • Metadata tags provide confirmation signals
  • Results blend semantic + structural relevance

Best Used With:

  • API documentation with code examples
  • Tutorial content with inline code
  • Technical guides with multiple languages
  • README files with usage examples

Usage with pgvector:

$sw-search ./docs \
> --backend pgvector \
> --connection-string "postgresql://user:pass@localhost:5432/db" \
> --output docs_collection \
> --chunking-strategy markdown

JSON Chunking

The json strategy allows you to provide pre-chunked content in a structured format. This is useful when you need custom control over how documents are split and indexed.

Expected JSON Format:

1{
2 "chunks": [
3 {
4 "chunk_id": "unique_id",
5 "type": "content",
6 "content": "The actual text content",
7 "metadata": {
8 "section": "Introduction",
9 "url": "https://example.com/docs/intro",
10 "custom_field": "any_value"
11 },
12 "tags": ["intro", "getting-started"]
13 }
14 ]
15}

Usage:

$## First preprocess your documents into JSON chunks
$python your_preprocessor.py input.txt -o chunks.json
$
$## Then build the index using JSON strategy
$sw-search chunks.json --chunking-strategy json --file-types json

Best Used For:

  • API documentation with complex structure
  • Documents that need custom parsing logic
  • Preserving specific metadata relationships
  • Integration with external preprocessing tools

Model Selection

Choose embedding model based on speed vs quality:

AliasModelDimsSpeedQuality
miniall-MiniLM-L6-v2384~5xGood
baseall-mpnet-base-v27681xHigh
largeall-mpnet-base-v27681xHighest
$## Fast model (default, recommended for most cases)
$sw-search ./docs --model mini
$
$## Balanced model
$sw-search ./docs --model base
$
$## Best quality
$sw-search ./docs --model large
$
$## Full model name
$sw-search ./docs --model sentence-transformers/all-mpnet-base-v2

File Filtering

$## Specific file types
$sw-search ./docs --file-types md,txt,rst,py
$
$## Exclude patterns
$sw-search ./docs --exclude "**/test/**,**/__pycache__/**,**/.git/**"
$
$## Language filtering
$sw-search ./docs --languages en,es,fr

Tags and Metadata

Add tags during build for filtered searching:

$## Add tags to all chunks
$sw-search ./docs --tags documentation,api,v2
$
$## Filter by tags when searching
$sw-search search index.swsearch "query" --tags documentation

Searching Indexes

$## Search with query
$sw-search search knowledge.swsearch "how to create an agent"
$
$## Limit results
$sw-search search knowledge.swsearch "API reference" --count 3
$
$## Verbose output with scores
$sw-search search knowledge.swsearch "configuration" --verbose

Search Options

OptionDefaultDescription
--count5Number of results
--distance-threshold0.0Minimum similarity score
--tags(none)Filter by tags
--query-nlp-backendnltkNLP backend: nltk or spacy
--keyword-weight(auto)Manual keyword weight (0.0-1.0)
--model(index)Override embedding model
--jsonfalseOutput as JSON
--no-contentfalseHide content, show metadata only
--verbosefalseDetailed output

Output Formats

$## Human-readable (default)
$sw-search search knowledge.swsearch "query"
$
$## JSON output
$sw-search search knowledge.swsearch "query" --json
$
$## Metadata only
$sw-search search knowledge.swsearch "query" --no-content
$
$## Full verbose output
$sw-search search knowledge.swsearch "query" --verbose

Filter by Tags

$## Single tag
$sw-search search knowledge.swsearch "functions" --tags documentation
$
$## Multiple tags
$sw-search search knowledge.swsearch "API" --tags api,reference

Interactive Search Shell

Load index once and search multiple times:

$sw-search search knowledge.swsearch --shell

Shell commands:

CommandDescription
helpShow help
exit/quit/qExit shell
count=NSet result count
tags=tag1,tag2Set tag filter
verboseToggle verbose output
<query>Search for query

Example session:

$ sw-search search knowledge.swsearch --shell
Search Shell - Index: knowledge.swsearch
Backend: sqlite
Index contains 1523 chunks from 47 files
Model: sentence-transformers/all-MiniLM-L6-v2
Type 'exit' or 'quit' to leave, 'help' for options
------------------------------------------------------------
search> how to create an agent
Found 5 result(s) for 'how to create an agent' (0.034s):
...
search> count=3
Result count set to: 3
search> SWAIG functions
Found 3 result(s) for 'SWAIG functions' (0.028s):
...
search> exit
Goodbye!

PostgreSQL/pgvector Backend

The search system supports multiple storage backends. Choose based on your deployment needs:

Backend Comparison

FeatureSQLitepgvector
Setup complexityNoneRequires PostgreSQL
ScalabilityLimitedExcellent
Concurrent accessPoorExcellent
Update capabilityRebuild requiredReal-time
Performance (small datasets)ExcellentGood
Performance (large datasets)PoorExcellent
DeploymentFile copyDatabase connection
Multi-agent supportSeparate copiesShared knowledge base

SQLite Backend (Default):

  • File-based .swsearch indexes
  • Portable single-file format
  • No external dependencies
  • Best for: Single-agent deployments, development, small to medium datasets

pgvector Backend:

  • Server-based PostgreSQL storage
  • Efficient similarity search with IVFFlat/HNSW indexes
  • Multiple agents can share the same knowledge base
  • Real-time updates without rebuilding
  • Best for: Production deployments, multi-agent systems, large datasets

Building with pgvector

$## Build to pgvector
$sw-search ./docs \
> --backend pgvector \
> --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
> --output docs_collection
$
$## With markdown strategy
$sw-search ./docs \
> --backend pgvector \
> --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
> --output docs_collection \
> --chunking-strategy markdown
$
$## Overwrite existing collection
$sw-search ./docs \
> --backend pgvector \
> --connection-string "postgresql://user:pass@localhost:5432/knowledge" \
> --output docs_collection \
> --overwrite

Search pgvector Collection

$sw-search search docs_collection "how to create an agent" \
> --backend pgvector \
> --connection-string "postgresql://user:pass@localhost/knowledge"

Migration

Migrate indexes between backends:

$## Get index information
$sw-search migrate --info ./docs.swsearch
$
$## Migrate SQLite to pgvector
$sw-search migrate ./docs.swsearch --to-pgvector \
> --connection-string "postgresql://user:pass@localhost/db" \
> --collection-name docs_collection
$
$## Migrate with overwrite
$sw-search migrate ./docs.swsearch --to-pgvector \
> --connection-string "postgresql://user:pass@localhost/db" \
> --collection-name docs_collection \
> --overwrite

Migration Options

OptionDescription
--infoShow index information
--to-pgvectorMigrate SQLite to pgvector
--to-sqliteMigrate pgvector to SQLite (planned)
--connection-stringPostgreSQL connection string
--collection-nameTarget collection name
--overwriteOverwrite existing collection
--batch-sizeChunks per batch (default: 100)

Local vs Remote Modes

The search skill supports both local and remote operation modes.

Local Mode (Default)

Searches are performed directly in the agent process using the embedded search engine.

Pros:

  • Faster (no network latency)
  • Works offline
  • Simple deployment
  • Lower operational complexity

Cons:

  • Higher memory usage per agent
  • Index files must be distributed with each agent
  • Updates require redeploying agents

Configuration in Agent:

1self.add_skill("native_vector_search", {
2 "tool_name": "search_docs",
3 "index_file": "docs.swsearch", # Local file
4 "nlp_backend": "nltk"
5})

Remote Mode

Searches are performed via HTTP API to a centralized search server.

Pros:

  • Lower memory usage per agent
  • Centralized index management
  • Easy updates without redeploying agents
  • Better scalability for multiple agents
  • Shared resources

Cons:

  • Network dependency
  • Additional infrastructure complexity
  • Potential latency

Configuration in Agent:

1self.add_skill("native_vector_search", {
2 "tool_name": "search_docs",
3 "remote_url": "http://localhost:8001", # Search server
4 "index_name": "docs",
5 "nlp_backend": "nltk"
6})

Automatic Mode Detection

The skill automatically detects which mode to use:

  • If remote_url is provided → Remote mode
  • If index_file is provided → Local mode
  • Remote mode takes priority if both are specified

Running a Remote Search Server

  1. Start the search server:
$python examples/search_server_standalone.py
  1. The server provides HTTP API:

    • POST /search - Search the indexes
    • GET /health - Health check and available indexes
    • POST /reload_index - Add or reload an index
  2. Test the API:

$curl -X POST "http://localhost:8001/search" \
> -H "Content-Type: application/json" \
> -d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Remote Search CLI

Search via remote API endpoint from the command line:

$## Basic remote search
$sw-search remote http://localhost:8001 "how to create an agent" \
> --index-name docs
$
$## With options
$sw-search remote localhost:8001 "API reference" \
> --index-name docs \
> --count 3 \
> --verbose
$
$## JSON output
$sw-search remote localhost:8001 "query" \
> --index-name docs \
> --json

Remote Options

OptionDefaultDescription
--index-name(required)Name of the index to search
--count5Number of results
--distance-threshold0.0Minimum similarity score
--tags(none)Filter by tags
--timeout30Request timeout in seconds
--jsonfalseOutput as JSON
--no-contentfalseHide content
--verbosefalseDetailed output

Validation

Verify index integrity:

$## Validate index
$sw-search validate ./docs.swsearch
$
$## Verbose validation
$sw-search validate ./docs.swsearch --verbose

Output:

✓ Index is valid: ./docs.swsearch
Chunks: 1523
Files: 47
Configuration:
embedding_model: sentence-transformers/all-MiniLM-L6-v2
embedding_dimensions: 384
chunking_strategy: markdown
created_at: 2025-01-15T10:30:00

JSON Export

Export chunks for review or external processing:

$## Export to single JSON file
$sw-search ./docs \
> --output-format json \
> --output all_chunks.json
$
$## Export to directory (one file per source)
$sw-search ./docs \
> --output-format json \
> --output-dir ./chunks/
$
$## Build index from exported JSON
$sw-search ./chunks/ \
> --chunking-strategy json \
> --file-types json \
> --output final.swsearch

NLP Backend Selection

Choose NLP backend for processing:

BackendSpeedQualityInstall Size
nltkFastGoodIncluded
spacySlowerBetterRequires: pip install signalwire-agents[search-nlp]
$## Index with NLTK (default)
$sw-search ./docs --index-nlp-backend nltk
$
$## Index with spaCy (better quality)
$sw-search ./docs --index-nlp-backend spacy
$
$## Query with NLTK
$sw-search search index.swsearch "query" --query-nlp-backend nltk
$
$## Query with spaCy
$sw-search search index.swsearch "query" --query-nlp-backend spacy

Complete Configuration Example

$sw-search ./docs ./examples README.md \
> --output ./knowledge.swsearch \
> --chunking-strategy sentence \
> --max-sentences-per-chunk 8 \
> --file-types md,txt,rst,py \
> --exclude "**/test/**,**/__pycache__/**" \
> --languages en,es,fr \
> --model sentence-transformers/all-mpnet-base-v2 \
> --tags documentation,api \
> --index-nlp-backend nltk \
> --validate \
> --verbose

Using with Skills

After building an index, use it with the native_vector_search skill:

1from signalwire_agents import AgentBase
2
3agent = AgentBase(name="search-agent")
4
5## Add search skill with built index
6agent.add_skill("native_vector_search", {
7 "index_path": "./knowledge.swsearch",
8 "tool_name": "search_docs",
9 "tool_description": "Search the documentation"
10})

Output Formats

FormatExtensionDescription
swsearch.swsearchSQLite-based portable index (default)
json.jsonJSON export of chunks
pgvector(database)PostgreSQL with pgvector extension

Installation Requirements

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)

$pip install "signalwire-agents[search]"

Includes:

  • Core search functionality
  • Sentence transformers for embeddings
  • SQLite FTS5 for keyword search
  • Basic document processing (text, markdown)

Full Document Processing (~600MB)

$pip install "signalwire-agents[search-full]"

Adds:

  • PDF processing (PyPDF2)
  • DOCX processing (python-docx)
  • HTML processing (BeautifulSoup4)
  • Additional file format support

Advanced NLP Features (~700MB)

$pip install "signalwire-agents[search-nlp]"

Adds:

  • spaCy for advanced text processing
  • NLTK for linguistic analysis
  • Enhanced query preprocessing
  • Language detection

Additional Setup Required:

$python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

All Search Features (~700MB)

$pip install "signalwire-agents[search-all]"

Includes everything above.

Additional Setup Required:

$python -m spacy download en_core_web_sm

Query-Only Mode (~400MB)

$pip install "signalwire-agents[search-queryonly]"

For agents that only need to query pre-built indexes without building new ones.

PostgreSQL Vector Support

$pip install "signalwire-agents[pgvector]"

Adds PostgreSQL with pgvector extension support for production deployments.

NLP Backend Selection

You can choose which NLP backend to use for query processing:

BackendSpeedQualityNotes
nltkFast (~50-100ms)GoodDefault, good for most use cases
spacySlower (~150-300ms)BetterBetter POS tagging and entity recognition

Configure via --index-nlp-backend (build) or --query-nlp-backend (search) flags.

API Reference

For programmatic access to the search system, use the Python API directly.

SearchEngine Class

1from signalwire_agents.search import SearchEngine
2
3## Load an index
4engine = SearchEngine("docs.swsearch")
5
6## Perform search
7results = engine.search(
8 query_vector=[...], # Optional: pre-computed query vector
9 enhanced_text="search query", # Enhanced query text
10 count=5, # Number of results
11 similarity_threshold=0.0, # Minimum similarity score
12 tags=["documentation"] # Filter by tags
13)
14
15## Get index statistics
16stats = engine.get_stats()
17print(f"Total chunks: {stats['total_chunks']}")
18print(f"Total files: {stats['total_files']}")

IndexBuilder Class

1from signalwire_agents.search import IndexBuilder
2
3## Create index builder
4builder = IndexBuilder(
5 model_name="sentence-transformers/all-mpnet-base-v2",
6 chunk_size=500,
7 chunk_overlap=50,
8 verbose=True
9)
10
11## Build index
12builder.build_index(
13 source_dir="./docs",
14 output_file="docs.swsearch",
15 file_types=["md", "txt"],
16 exclude_patterns=["**/test/**"],
17 tags=["documentation"]
18)

Troubleshooting

IssueSolution
Search not availablepip install signalwire-agents[search]
pgvector errorspip install signalwire-agents[pgvector]
PDF processing failspip install signalwire-agents[search-full]
spaCy not foundpip install signalwire-agents[search-nlp]
No results foundTry different chunking strategy
Poor search qualityUse --model base or larger chunks
Index too largeUse --model mini, reduce file types
Connection refused (remote)Check search server is running