spider | SignalWire

Fast web scraping and crawling. Fetches web pages and extracts content optimized for token efficiency.

Tools: scrape_url, crawl_site, extract_structured_data

Requirements: lxml

Multi-instance: Yes

delay

floatDefaults to 0.1

Delay between requests in seconds.

concurrent_requests

intDefaults to 5

Number of concurrent requests allowed (1–20).

timeout

intDefaults to 5

Request timeout in seconds (1–60).

max_pages

intDefaults to 1

Maximum number of pages to scrape (1–100).

max_depth

intDefaults to 0

Maximum crawl depth. 0 means single page only (0–5).

extract_type

strDefaults to fast_text

Content extraction method: "fast_text", "clean_text", "full_text", "html", or "custom".

max_text_length

intDefaults to 10000

Maximum text length to return (100–100000).

clean_text

boolDefaults to True

Whether to clean extracted text by collapsing whitespace.

selectors

dictDefaults to {}

Custom CSS or XPath selectors for structured data extraction. Keys are field names, values are selector strings.

follow_patterns

listDefaults to []

URL patterns (regex strings) to follow when crawling. Only links matching at least one pattern are followed.

user_agent

str

User agent string sent with each request.

headers

dictDefaults to {}

Additional HTTP headers to include with each request.

follow_robots_txt

boolDefaults to True

Whether to respect robots.txt rules when crawling.

cache_enabled

boolDefaults to True

Whether to cache scraped pages in memory to avoid re-fetching.

1 from signalwire import AgentBase
2 
3 class MyAgent(AgentBase):
4     def __init__(self):
5         super().__init__(name="assistant", route="/assistant")
6         self.set_prompt_text("You are a helpful assistant.")
7         self.add_skill("spider", {
8             "timeout": 10,
9             "concurrent_requests": 3,
10             "max_pages": 5
11         })
12 
13 agent = MyAgent()
14 agent.serve()