SpiderSkill

Fast web scraping and crawling. Extracts text, markdown, or structured data from any public URL, optionally following links up to a bounded depth. Uses cheerio for parsing and enforces an SSRF guard on crawl hops.

Class: SpiderSkill

Tools: scrape_url, crawl_site, extract_structured_data (each is prefixed with <tool_name>_ when tool_name is set).

Required packages: cheerio

Env vars: SWML_ALLOW_PRIVATE_URLS=true relaxes the SSRF guard for local testing.

Multi-instance: yes — set a distinct tool_name per instance.

tool_name

string

Prefix prepended to each emitted tool name (e.g., tool_name="news" gives news_scrape_url, news_crawl_site, news_extract_structured_data). Required when registering multiple instances on the same agent.

delay

numberDefaults to 0.1

Delay between requests in seconds (minimum 0).

concurrent_requests

integerDefaults to 5

Number of concurrent requests allowed (range 1-20).

timeout

integerDefaults to 5

Per-request timeout in seconds (range 1-60).

max_pages

integerDefaults to 1

Maximum number of pages to scrape (range 1-100).

max_depth

integerDefaults to 0

Maximum crawl depth. 0 restricts to a single page; range 0-5.

extract_type

stringDefaults to fast_text

Content extraction method. One of "fast_text", "clean_text", "full_text", "html", "markdown", "structured", "custom". Only fast_text, markdown, and structured are wired through the handlers in the TypeScript port; the others fall back to fast_text.

max_text_length

integerDefaults to 3000

Maximum extracted text length in characters (range 100-100000).

clean_text

booleanDefaults to true

Whether to clean extracted text (trim whitespace, collapse runs, etc.).

selectors

Record<string, string>Defaults to {}

Map of name → CSS selector used for structured extraction.

follow_patterns

string[]Defaults to []

URL patterns to follow when crawling.

user_agent

string

User-Agent header for outbound requests. Defaults to a Chrome-compatible UA string.

headers

Record<string, string>Defaults to {}

Additional HTTP headers sent with each request.

follow_robots_txt

booleanDefaults to false

Whether to respect robots.txt. Defaults to false to match Python’s runtime behavior.

cache_enabled

booleanDefaults to true

Whether to cache scraped pages in memory.

Example

1 import { AgentBase, SpiderSkill } from '@signalwire/sdk';
2 
3 const agent = new AgentBase({ name: 'assistant', route: '/assistant' });
4 agent.setPromptText('You are a research assistant.');
5 
6 await agent.addSkill(new SpiderSkill({
7   extract_type: 'markdown',
8   max_pages: 5,
9   max_depth: 1,
10   follow_patterns: ['/docs/'],
11 }));
12 
13 agent.run();

Fast web scraping and crawling. Extracts text, markdown, or structured data from any public URL, optionally following links up to a bounded depth. Uses cheerio for parsing and enforces an SSRF guard on crawl hops.

Class: SpiderSkill

Tools: scrape_url, crawl_site, extract_structured_data (each is prefixed with <tool_name>_ when tool_name is set).

Required packages: cheerio

Env vars: SWML_ALLOW_PRIVATE_URLS=true relaxes the SSRF guard for local testing.

Multi-instance: yes — set a distinct tool_name per instance.

tool_name

string

Prefix prepended to each emitted tool name (e.g., tool_name="news" gives news_scrape_url, news_crawl_site, news_extract_structured_data). Required when registering multiple instances on the same agent.

delay

numberDefaults to 0.1

Delay between requests in seconds (minimum 0).

concurrent_requests

integerDefaults to 5

Number of concurrent requests allowed (range 1-20).

timeout

integerDefaults to 5

Per-request timeout in seconds (range 1-60).

max_pages

integerDefaults to 1

Maximum number of pages to scrape (range 1-100).

max_depth

integerDefaults to 0

Maximum crawl depth. 0 restricts to a single page; range 0-5.

extract_type

stringDefaults to fast_text

Content extraction method. One of "fast_text", "clean_text", "full_text", "html", "markdown", "structured", "custom". Only fast_text, markdown, and structured are wired through the handlers in the TypeScript port; the others fall back to fast_text.

max_text_length

integerDefaults to 3000

Maximum extracted text length in characters (range 100-100000).

clean_text

booleanDefaults to true

Whether to clean extracted text (trim whitespace, collapse runs, etc.).

selectors

Record<string, string>Defaults to {}

Map of name → CSS selector used for structured extraction.

follow_patterns

string[]Defaults to []

URL patterns to follow when crawling.

user_agent

string

User-Agent header for outbound requests. Defaults to a Chrome-compatible UA string.

headers

Record<string, string>Defaults to {}

Additional HTTP headers sent with each request.

follow_robots_txt

booleanDefaults to false

Whether to respect robots.txt. Defaults to false to match Python’s runtime behavior.

cache_enabled

booleanDefaults to true

Whether to cache scraped pages in memory.

Example

1 import { AgentBase, SpiderSkill } from '@signalwire/sdk';
2 
3 const agent = new AgentBase({ name: 'assistant', route: '/assistant' });
4 agent.setPromptText('You are a research assistant.');
5 
6 await agent.addSkill(new SpiderSkill({
7   extract_type: 'markdown',
8   max_pages: 5,
9   max_depth: 1,
10   follow_patterns: ['/docs/'],
11 }));
12 
13 agent.run();

1	import { AgentBase, SpiderSkill } from '@signalwire/sdk';
2
3	const agent = new AgentBase({ name: 'assistant', route: '/assistant' });
4	agent.setPromptText('You are a research assistant.');
5
6	await agent.addSkill(new SpiderSkill({
7	extract_type: 'markdown',
8	max_pages: 5,
9	max_depth: 1,
10	follow_patterns: ['/docs/'],
11	}));
12
13	agent.run();