SpiderSkill
Fast web scraping and crawling. Extracts text, markdown, or structured data
from any public URL, optionally following links up to a bounded depth. Uses
cheerio for parsing and enforces an SSRF guard on crawl hops.
Class: SpiderSkill
Tools: scrape_url, crawl_site, extract_structured_data (each is
prefixed with <tool_name>_ when tool_name is set).
Required packages: cheerio
Env vars: SWML_ALLOW_PRIVATE_URLS=true relaxes the SSRF guard for local
testing.
Multi-instance: yes — set a distinct tool_name per instance.
tool_name
Prefix prepended to each emitted tool name (e.g., tool_name="news" gives
news_scrape_url, news_crawl_site, news_extract_structured_data).
Required when registering multiple instances on the same agent.
delay
Delay between requests in seconds (minimum 0).
concurrent_requests
Number of concurrent requests allowed (range 1-20).
timeout
Per-request timeout in seconds (range 1-60).
max_pages
Maximum number of pages to scrape (range 1-100).
max_depth
Maximum crawl depth. 0 restricts to a single page; range 0-5.
extract_type
Content extraction method. One of "fast_text", "clean_text",
"full_text", "html", "markdown", "structured", "custom". Only
fast_text, markdown, and structured are wired through the handlers
in the TypeScript port; the others fall back to fast_text.
max_text_length
Maximum extracted text length in characters (range 100-100000).
clean_text
Whether to clean extracted text (trim whitespace, collapse runs, etc.).
selectors
Map of name → CSS selector used for structured extraction.
follow_patterns
URL patterns to follow when crawling.
user_agent
User-Agent header for outbound requests. Defaults to a Chrome-compatible UA string.
headers
Additional HTTP headers sent with each request.
follow_robots_txt
Whether to respect robots.txt. Defaults to false to match Python’s
runtime behavior.
cache_enabled
Whether to cache scraped pages in memory.