AgentsSkills

spider

View as MarkdownOpen in Claude

Fast web scraping and crawling. Fetches web pages and extracts content optimized for token efficiency.

Tools: scrape_url, crawl_site, extract_structured_data

Requirements: lxml

Multi-instance: Yes

delay
floatDefaults to 0.1

Delay between requests in seconds.

concurrent_requests
intDefaults to 5

Number of concurrent requests allowed (1–20).

timeout
intDefaults to 5

Request timeout in seconds (1–60).

max_pages
intDefaults to 1

Maximum number of pages to scrape (1–100).

max_depth
intDefaults to 0

Maximum crawl depth. 0 means single page only (0–5).

extract_type
strDefaults to fast_text

Content extraction method: "fast_text", "clean_text", "full_text", "html", or "custom".

max_text_length
intDefaults to 10000

Maximum text length to return (100–100000).

clean_text
boolDefaults to True

Whether to clean extracted text by collapsing whitespace.

selectors
dictDefaults to {}

Custom CSS or XPath selectors for structured data extraction. Keys are field names, values are selector strings.

follow_patterns
listDefaults to []

URL patterns (regex strings) to follow when crawling. Only links matching at least one pattern are followed.

user_agent
str

User agent string sent with each request.

headers
dictDefaults to {}

Additional HTTP headers to include with each request.

follow_robots_txt
boolDefaults to True

Whether to respect robots.txt rules when crawling.

cache_enabled
boolDefaults to True

Whether to cache scraped pages in memory to avoid re-fetching.

1from signalwire import AgentBase
2
3class MyAgent(AgentBase):
4 def __init__(self):
5 super().__init__(name="assistant", route="/assistant")
6 self.set_prompt_text("You are a helpful assistant.")
7 self.add_skill("spider", {
8 "timeout": 10,
9 "concurrent_requests": 3,
10 "max_pages": 5
11 })
12
13agent = MyAgent()
14agent.serve()