***
id: fe965e69-bac0-4bfe-8531-ed87e2f0487f
title: Cli Sw Search
sidebar-title: Cli Sw Search
slug: /python/reference/cli-sw-search
max-toc-depth: 3
----------------
## sw-search CLI
Command-line tool for building, searching, and managing vector search indexes for AI agent knowledge bases.
### Overview
The `sw-search` tool builds vector search indexes from documents for use with the native\_vector\_search skill.
**Capabilities:**
* Build indexes from documents (MD, TXT, PDF, DOCX, RST, PY)
* Multiple chunking strategies for different content types
* SQLite and PostgreSQL/pgvector storage backends
* Interactive search shell for index exploration
* Export chunks to JSON for review or external processing
* Migrate indexes between backends
* Search via remote API endpoints
### Architecture
The system provides:
* **Offline Search**: No external API calls or internet required
* **Hybrid Search**: Combines vector similarity and keyword search
* **Smart Chunking**: Intelligent document segmentation with context preservation
* **Advanced Query Processing**: NLP-enhanced query understanding
* **Flexible Deployment**: Local embedded mode or remote server mode
* **SQLite Storage**: Portable `.swsearch` index files
### Command Modes
sw-search operates in five modes:
| Mode | Syntax | Purpose |
| -------- | ----------------------------- | ------------------------ |
| build | `sw-search ./docs` | Build search index |
| search | `sw-search search FILE QUERY` | Search existing index |
| validate | `sw-search validate FILE` | Validate index integrity |
| migrate | `sw-search migrate FILE` | Migrate between backends |
| remote | `sw-search remote URL QUERY` | Search via remote API |
### Quick Start
```bash
## Build index from documentation
sw-search ./docs --output knowledge.swsearch
## Search the index
sw-search search knowledge.swsearch "how to create an agent"
## Interactive search shell
sw-search search knowledge.swsearch --shell
## Validate index
sw-search validate knowledge.swsearch
```
### Building Indexes
#### Index Structure
Each `.swsearch` file is a SQLite database containing:
* **Document chunks** with embeddings and metadata
* **Full-text search index** (SQLite FTS5) for keyword search
* **Configuration** and model information
* **Synonym cache** for query expansion
This portable format allows you to build indexes once and distribute them with your agents.
#### Basic Usage
```bash
## Build from single directory
sw-search ./docs
## Build from multiple directories
sw-search ./docs ./examples --file-types md,txt,py
## Build from individual files
sw-search README.md ./docs/guide.md ./src/main.py
## Mixed sources (directories and files)
sw-search ./docs README.md ./examples specific_file.txt
## Specify output file
sw-search ./docs --output ./knowledge.swsearch
```
#### Build Options
| Option | Default | Description |
| ------------------ | ---------------- | --------------------------- |
| `--output FILE` | sources.swsearch | Output file or collection |
| `--output-dir DIR` | (none) | Output directory |
| `--output-format` | index | Output: index or json |
| `--backend` | sqlite | Storage: sqlite or pgvector |
| `--file-types` | md,txt,rst | Comma-separated extensions |
| `--exclude` | (none) | Glob patterns to exclude |
| `--languages` | en | Language codes |
| `--tags` | (none) | Tags for all chunks |
| `--validate` | false | Validate after building |
| `--verbose` | false | Detailed output |
### Chunking Strategies
Choose the right strategy for your content:
| Strategy | Best For | Key Options |
| --------- | ------------------------------ | -------------------------------- |
| sentence | General prose, articles | `--max-sentences-per-chunk` |
| sliding | Code, technical documentation | `--chunk-size`, `--overlap-size` |
| paragraph | Structured documents | (none) |
| page | PDFs with distinct pages | (none) |
| semantic | Coherent topic grouping | `--semantic-threshold` |
| topic | Long documents by subject | `--topic-threshold` |
| qa | Question-answering apps | (none) |
| markdown | Documentation with code blocks | (preserves structure) |
| json | Pre-chunked content | (none) |
#### Sentence Chunking (Default)
Groups sentences together:
```bash
## Default: 5 sentences per chunk
sw-search ./docs --chunking-strategy sentence
## Custom sentence count
sw-search ./docs \
--chunking-strategy sentence \
--max-sentences-per-chunk 10
## Split on multiple newlines
sw-search ./docs \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--split-newlines 2
```
#### Sliding Window Chunking
Fixed-size chunks with overlap:
```bash
sw-search ./docs \
--chunking-strategy sliding \
--chunk-size 100 \
--overlap-size 20
```
#### Paragraph Chunking
Splits on double newlines:
```bash
sw-search ./docs \
--chunking-strategy paragraph \
--file-types md,txt,rst
```
#### Page Chunking
Best for PDFs:
```bash
sw-search ./docs \
--chunking-strategy page \
--file-types pdf
```
#### Semantic Chunking
Groups semantically similar sentences:
```bash
sw-search ./docs \
--chunking-strategy semantic \
--semantic-threshold 0.6
```
#### Topic Chunking
Detects topic changes:
```bash
sw-search ./docs \
--chunking-strategy topic \
--topic-threshold 0.2
```
#### QA Chunking
Optimized for question-answering:
```bash
sw-search ./docs --chunking-strategy qa
```
#### Markdown Chunking
The `markdown` strategy is specifically designed for documentation that contains code examples. It understands markdown structure and adds rich metadata for better search results.
```bash
sw-search ./docs \
--chunking-strategy markdown \
--file-types md
```
**Features:**
* **Header-based chunking**: Splits at markdown headers (h1, h2, h3...) for natural boundaries
* **Code block detection**: Identifies fenced code blocks and extracts language (`python, `bash, etc.)
* **Smart tagging**: Adds `"code"` tags to chunks with code, plus language-specific tags
* **Section hierarchy**: Preserves full path (e.g., "API Reference > AgentBase > Methods")
* **Code protection**: Never splits inside code blocks
* **Metadata enrichment**: Header levels stored as searchable metadata
**Example Metadata:**
```json
{
"chunk_type": "markdown",
"h1": "API Reference",
"h2": "AgentBase",
"h3": "add_skill Method",
"has_code": true,
"code_languages": ["python", "bash"],
"tags": ["code", "code:python", "code:bash", "depth:3"]
}
```
**Search Benefits:**
When users search for "example code Python":
* Chunks with code blocks get automatic 20% boost
* Python-specific code gets language match bonus
* Vector similarity provides primary semantic ranking
* Metadata tags provide confirmation signals
* Results blend semantic + structural relevance
**Best Used With:**
* API documentation with code examples
* Tutorial content with inline code
* Technical guides with multiple languages
* README files with usage examples
**Usage with pgvector:**
```bash
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/db" \
--output docs_collection \
--chunking-strategy markdown
```
#### JSON Chunking
The `json` strategy allows you to provide pre-chunked content in a structured format. This is useful when you need custom control over how documents are split and indexed.
**Expected JSON Format:**
```json
{
"chunks": [
{
"chunk_id": "unique_id",
"type": "content",
"content": "The actual text content",
"metadata": {
"section": "Introduction",
"url": "https://example.com/docs/intro",
"custom_field": "any_value"
},
"tags": ["intro", "getting-started"]
}
]
}
```
**Usage:**
```bash
## First preprocess your documents into JSON chunks
python your_preprocessor.py input.txt -o chunks.json
## Then build the index using JSON strategy
sw-search chunks.json --chunking-strategy json --file-types json
```
**Best Used For:**
* API documentation with complex structure
* Documents that need custom parsing logic
* Preserving specific metadata relationships
* Integration with external preprocessing tools
### Model Selection
Choose embedding model based on speed vs quality:
| Alias | Model | Dims | Speed | Quality |
| ----- | ----------------- | ---- | ----- | ------- |
| mini | all-MiniLM-L6-v2 | 384 | \~5x | Good |
| base | all-mpnet-base-v2 | 768 | 1x | High |
| large | all-mpnet-base-v2 | 768 | 1x | Highest |
```bash
## Fast model (default, recommended for most cases)
sw-search ./docs --model mini
## Balanced model
sw-search ./docs --model base
## Best quality
sw-search ./docs --model large
## Full model name
sw-search ./docs --model sentence-transformers/all-mpnet-base-v2
```
### File Filtering
```bash
## Specific file types
sw-search ./docs --file-types md,txt,rst,py
## Exclude patterns
sw-search ./docs --exclude "**/test/**,**/__pycache__/**,**/.git/**"
## Language filtering
sw-search ./docs --languages en,es,fr
```
### Tags and Metadata
Add tags during build for filtered searching:
```bash
## Add tags to all chunks
sw-search ./docs --tags documentation,api,v2
## Filter by tags when searching
sw-search search index.swsearch "query" --tags documentation
```
### Searching Indexes
#### Basic Search
```bash
## Search with query
sw-search search knowledge.swsearch "how to create an agent"
## Limit results
sw-search search knowledge.swsearch "API reference" --count 3
## Verbose output with scores
sw-search search knowledge.swsearch "configuration" --verbose
```
#### Search Options
| Option | Default | Description |
| ---------------------- | ------- | -------------------------------- |
| `--count` | 5 | Number of results |
| `--distance-threshold` | 0.0 | Minimum similarity score |
| `--tags` | (none) | Filter by tags |
| `--query-nlp-backend` | nltk | NLP backend: nltk or spacy |
| `--keyword-weight` | (auto) | Manual keyword weight (0.0-1.0) |
| `--model` | (index) | Override embedding model |
| `--json` | false | Output as JSON |
| `--no-content` | false | Hide content, show metadata only |
| `--verbose` | false | Detailed output |
#### Output Formats
```bash
## Human-readable (default)
sw-search search knowledge.swsearch "query"
## JSON output
sw-search search knowledge.swsearch "query" --json
## Metadata only
sw-search search knowledge.swsearch "query" --no-content
## Full verbose output
sw-search search knowledge.swsearch "query" --verbose
```
#### Filter by Tags
```bash
## Single tag
sw-search search knowledge.swsearch "functions" --tags documentation
## Multiple tags
sw-search search knowledge.swsearch "API" --tags api,reference
```
### Interactive Search Shell
Load index once and search multiple times:
```bash
sw-search search knowledge.swsearch --shell
```
Shell commands:
| Command | Description |
| ----------------- | --------------------- |
| `help` | Show help |
| `exit`/`quit`/`q` | Exit shell |
| `count=N` | Set result count |
| `tags=tag1,tag2` | Set tag filter |
| `verbose` | Toggle verbose output |
| `` | Search for query |
Example session:
```
$ sw-search search knowledge.swsearch --shell
Search Shell - Index: knowledge.swsearch
Backend: sqlite
Index contains 1523 chunks from 47 files
Model: sentence-transformers/all-MiniLM-L6-v2
Type 'exit' or 'quit' to leave, 'help' for options
------------------------------------------------------------
search> how to create an agent
Found 5 result(s) for 'how to create an agent' (0.034s):
...
search> count=3
Result count set to: 3
search> SWAIG functions
Found 3 result(s) for 'SWAIG functions' (0.028s):
...
search> exit
Goodbye!
```
### PostgreSQL/pgvector Backend
The search system supports multiple storage backends. Choose based on your deployment needs:
#### Backend Comparison
| Feature | SQLite | pgvector |
| ---------------------------- | ---------------- | --------------------- |
| Setup complexity | None | Requires PostgreSQL |
| Scalability | Limited | Excellent |
| Concurrent access | Poor | Excellent |
| Update capability | Rebuild required | Real-time |
| Performance (small datasets) | Excellent | Good |
| Performance (large datasets) | Poor | Excellent |
| Deployment | File copy | Database connection |
| Multi-agent support | Separate copies | Shared knowledge base |
**SQLite Backend (Default):**
* File-based `.swsearch` indexes
* Portable single-file format
* No external dependencies
* Best for: Single-agent deployments, development, small to medium datasets
**pgvector Backend:**
* Server-based PostgreSQL storage
* Efficient similarity search with IVFFlat/HNSW indexes
* Multiple agents can share the same knowledge base
* Real-time updates without rebuilding
* Best for: Production deployments, multi-agent systems, large datasets
#### Building with pgvector
```bash
## Build to pgvector
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection
## With markdown strategy
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection \
--chunking-strategy markdown
## Overwrite existing collection
sw-search ./docs \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost:5432/knowledge" \
--output docs_collection \
--overwrite
```
#### Search pgvector Collection
```bash
sw-search search docs_collection "how to create an agent" \
--backend pgvector \
--connection-string "postgresql://user:pass@localhost/knowledge"
```
### Migration
Migrate indexes between backends:
```bash
## Get index information
sw-search migrate --info ./docs.swsearch
## Migrate SQLite to pgvector
sw-search migrate ./docs.swsearch --to-pgvector \
--connection-string "postgresql://user:pass@localhost/db" \
--collection-name docs_collection
## Migrate with overwrite
sw-search migrate ./docs.swsearch --to-pgvector \
--connection-string "postgresql://user:pass@localhost/db" \
--collection-name docs_collection \
--overwrite
```
#### Migration Options
| Option | Description |
| --------------------- | ------------------------------------ |
| `--info` | Show index information |
| `--to-pgvector` | Migrate SQLite to pgvector |
| `--to-sqlite` | Migrate pgvector to SQLite (planned) |
| `--connection-string` | PostgreSQL connection string |
| `--collection-name` | Target collection name |
| `--overwrite` | Overwrite existing collection |
| `--batch-size` | Chunks per batch (default: 100) |
### Local vs Remote Modes
The search skill supports both local and remote operation modes.
#### Local Mode (Default)
Searches are performed directly in the agent process using the embedded search engine.
**Pros:**
* Faster (no network latency)
* Works offline
* Simple deployment
* Lower operational complexity
**Cons:**
* Higher memory usage per agent
* Index files must be distributed with each agent
* Updates require redeploying agents
**Configuration in Agent:**
```python
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch", # Local file
"nlp_backend": "nltk"
})
```
#### Remote Mode
Searches are performed via HTTP API to a centralized search server.
**Pros:**
* Lower memory usage per agent
* Centralized index management
* Easy updates without redeploying agents
* Better scalability for multiple agents
* Shared resources
**Cons:**
* Network dependency
* Additional infrastructure complexity
* Potential latency
**Configuration in Agent:**
```python
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"remote_url": "http://localhost:8001", # Search server
"index_name": "docs",
"nlp_backend": "nltk"
})
```
#### Automatic Mode Detection
The skill automatically detects which mode to use:
* If `remote_url` is provided → Remote mode
* If `index_file` is provided → Local mode
* Remote mode takes priority if both are specified
#### Running a Remote Search Server
1. **Start the search server:**
```bash
python examples/search_server_standalone.py
```
2. **The server provides HTTP API:**
* `POST /search` - Search the indexes
* `GET /health` - Health check and available indexes
* `POST /reload_index` - Add or reload an index
3. **Test the API:**
```bash
curl -X POST "http://localhost:8001/search" \
-H "Content-Type: application/json" \
-d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'
```
### Remote Search CLI
Search via remote API endpoint from the command line:
```bash
## Basic remote search
sw-search remote http://localhost:8001 "how to create an agent" \
--index-name docs
## With options
sw-search remote localhost:8001 "API reference" \
--index-name docs \
--count 3 \
--verbose
## JSON output
sw-search remote localhost:8001 "query" \
--index-name docs \
--json
```
#### Remote Options
| Option | Default | Description |
| ---------------------- | ---------- | --------------------------- |
| `--index-name` | (required) | Name of the index to search |
| `--count` | 5 | Number of results |
| `--distance-threshold` | 0.0 | Minimum similarity score |
| `--tags` | (none) | Filter by tags |
| `--timeout` | 30 | Request timeout in seconds |
| `--json` | false | Output as JSON |
| `--no-content` | false | Hide content |
| `--verbose` | false | Detailed output |
### Validation
Verify index integrity:
```bash
## Validate index
sw-search validate ./docs.swsearch
## Verbose validation
sw-search validate ./docs.swsearch --verbose
```
Output:
```
✓ Index is valid: ./docs.swsearch
Chunks: 1523
Files: 47
Configuration:
embedding_model: sentence-transformers/all-MiniLM-L6-v2
embedding_dimensions: 384
chunking_strategy: markdown
created_at: 2025-01-15T10:30:00
```
### JSON Export
Export chunks for review or external processing:
```bash
## Export to single JSON file
sw-search ./docs \
--output-format json \
--output all_chunks.json
## Export to directory (one file per source)
sw-search ./docs \
--output-format json \
--output-dir ./chunks/
## Build index from exported JSON
sw-search ./chunks/ \
--chunking-strategy json \
--file-types json \
--output final.swsearch
```
### NLP Backend Selection
Choose NLP backend for processing:
| Backend | Speed | Quality | Install Size |
| ------- | ------ | ------- | ----------------------------------------------------- |
| nltk | Fast | Good | Included |
| spacy | Slower | Better | Requires: `pip install signalwire-agents[search-nlp]` |
```bash
## Index with NLTK (default)
sw-search ./docs --index-nlp-backend nltk
## Index with spaCy (better quality)
sw-search ./docs --index-nlp-backend spacy
## Query with NLTK
sw-search search index.swsearch "query" --query-nlp-backend nltk
## Query with spaCy
sw-search search index.swsearch "query" --query-nlp-backend spacy
```
### Complete Configuration Example
```bash
sw-search ./docs ./examples README.md \
--output ./knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,txt,rst,py \
--exclude "**/test/**,**/__pycache__/**" \
--languages en,es,fr \
--model sentence-transformers/all-mpnet-base-v2 \
--tags documentation,api \
--index-nlp-backend nltk \
--validate \
--verbose
```
### Using with Skills
After building an index, use it with the native\_vector\_search skill:
```python
from signalwire_agents import AgentBase
agent = AgentBase(name="search-agent")
## Add search skill with built index
agent.add_skill("native_vector_search", {
"index_path": "./knowledge.swsearch",
"tool_name": "search_docs",
"tool_description": "Search the documentation"
})
```
### Output Formats
| Format | Extension | Description |
| -------- | ---------- | ------------------------------------- |
| swsearch | .swsearch | SQLite-based portable index (default) |
| json | .json | JSON export of chunks |
| pgvector | (database) | PostgreSQL with pgvector extension |
### Installation Requirements
The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:
#### Basic Search (\~500MB)
```bash
pip install "signalwire-agents[search]"
```
**Includes:**
* Core search functionality
* Sentence transformers for embeddings
* SQLite FTS5 for keyword search
* Basic document processing (text, markdown)
#### Full Document Processing (\~600MB)
```bash
pip install "signalwire-agents[search-full]"
```
**Adds:**
* PDF processing (PyPDF2)
* DOCX processing (python-docx)
* HTML processing (BeautifulSoup4)
* Additional file format support
#### Advanced NLP Features (\~700MB)
```bash
pip install "signalwire-agents[search-nlp]"
```
**Adds:**
* spaCy for advanced text processing
* NLTK for linguistic analysis
* Enhanced query preprocessing
* Language detection
**Additional Setup Required:**
```bash
python -m spacy download en_core_web_sm
```
**Performance Note:** Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.
#### All Search Features (\~700MB)
```bash
pip install "signalwire-agents[search-all]"
```
**Includes everything above.**
**Additional Setup Required:**
```bash
python -m spacy download en_core_web_sm
```
#### Query-Only Mode (\~400MB)
```bash
pip install "signalwire-agents[search-queryonly]"
```
For agents that only need to query pre-built indexes without building new ones.
#### PostgreSQL Vector Support
```bash
pip install "signalwire-agents[pgvector]"
```
Adds PostgreSQL with pgvector extension support for production deployments.
#### NLP Backend Selection
You can choose which NLP backend to use for query processing:
| Backend | Speed | Quality | Notes |
| ------- | -------------------- | ------- | ----------------------------------------- |
| nltk | Fast (\~50-100ms) | Good | Default, good for most use cases |
| spacy | Slower (\~150-300ms) | Better | Better POS tagging and entity recognition |
Configure via `--index-nlp-backend` (build) or `--query-nlp-backend` (search) flags.
### API Reference
For programmatic access to the search system, use the Python API directly.
#### SearchEngine Class
```python
from signalwire_agents.search import SearchEngine
## Load an index
engine = SearchEngine("docs.swsearch")
## Perform search
results = engine.search(
query_vector=[...], # Optional: pre-computed query vector
enhanced_text="search query", # Enhanced query text
count=5, # Number of results
similarity_threshold=0.0, # Minimum similarity score
tags=["documentation"] # Filter by tags
)
## Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")
```
#### IndexBuilder Class
```python
from signalwire_agents.search import IndexBuilder
## Create index builder
builder = IndexBuilder(
model_name="sentence-transformers/all-mpnet-base-v2",
chunk_size=500,
chunk_overlap=50,
verbose=True
)
## Build index
builder.build_index(
source_dir="./docs",
output_file="docs.swsearch",
file_types=["md", "txt"],
exclude_patterns=["**/test/**"],
tags=["documentation"]
)
```
### Troubleshooting
| Issue | Solution |
| --------------------------- | -------------------------------------------- |
| Search not available | `pip install signalwire-agents[search]` |
| pgvector errors | `pip install signalwire-agents[pgvector]` |
| PDF processing fails | `pip install signalwire-agents[search-full]` |
| spaCy not found | `pip install signalwire-agents[search-nlp]` |
| No results found | Try different chunking strategy |
| Poor search quality | Use `--model base` or larger chunks |
| Index too large | Use `--model mini`, reduce file types |
| Connection refused (remote) | Check search server is running |
### Related Documentation
* [native\_vector\_search Skill](/docs/agents-sdk/python/guides/builtin-skills#native_vector_search) - Using search indexes in agents
* [Skills Overview](/docs/agents-sdk/python/guides/understanding-skills) - Adding skills to agents
* [DataSphere Integration](/docs/agents-sdk/python/guides/builtin-skills#datasphere) - Cloud-based search alternative