SpiderSkill

View as MarkdownOpen in Claude

Fast web scraping and crawling. Extracts text, markdown, or structured data from any public URL, optionally following links up to a bounded depth. Uses cheerio for parsing and enforces an SSRF guard on crawl hops.

Class: SpiderSkill

Tools: scrape_url, crawl_site, extract_structured_data (each is prefixed with <tool_name>_ when tool_name is set).

Required packages: cheerio

Env vars: SWML_ALLOW_PRIVATE_URLS=true relaxes the SSRF guard for local testing.

Multi-instance: yes — set a distinct tool_name per instance.

tool_name
string

Prefix prepended to each emitted tool name (e.g., tool_name="news" gives news_scrape_url, news_crawl_site, news_extract_structured_data). Required when registering multiple instances on the same agent.

delay
numberDefaults to 0.1

Delay between requests in seconds (minimum 0).

concurrent_requests
integerDefaults to 5

Number of concurrent requests allowed (range 1-20).

timeout
integerDefaults to 5

Per-request timeout in seconds (range 1-60).

max_pages
integerDefaults to 1

Maximum number of pages to scrape (range 1-100).

max_depth
integerDefaults to 0

Maximum crawl depth. 0 restricts to a single page; range 0-5.

extract_type
stringDefaults to fast_text

Content extraction method. One of "fast_text", "clean_text", "full_text", "html", "markdown", "structured", "custom". Only fast_text, markdown, and structured are wired through the handlers in the TypeScript port; the others fall back to fast_text.

max_text_length
integerDefaults to 3000

Maximum extracted text length in characters (range 100-100000).

clean_text
booleanDefaults to true

Whether to clean extracted text (trim whitespace, collapse runs, etc.).

selectors
Record<string, string>Defaults to {}

Map of name → CSS selector used for structured extraction.

follow_patterns
string[]Defaults to []

URL patterns to follow when crawling.

user_agent
string

User-Agent header for outbound requests. Defaults to a Chrome-compatible UA string.

headers
Record<string, string>Defaults to {}

Additional HTTP headers sent with each request.

follow_robots_txt
booleanDefaults to false

Whether to respect robots.txt. Defaults to false to match Python’s runtime behavior.

cache_enabled
booleanDefaults to true

Whether to cache scraped pages in memory.

Example

1import { AgentBase, SpiderSkill } from '@signalwire/sdk';
2
3const agent = new AgentBase({ name: 'assistant', route: '/assistant' });
4agent.setPromptText('You are a research assistant.');
5
6await agent.addSkill(new SpiderSkill({
7 extract_type: 'markdown',
8 max_pages: 5,
9 max_depth: 1,
10 follow_patterns: ['/docs/'],
11}));
12
13agent.run();