spider
Fast web scraping and crawling. Fetches web pages and extracts content optimized for token efficiency.
Tools: scrape_url, crawl_site, extract_structured_data
Requirements: lxml
Multi-instance: Yes
delay
Delay between requests in seconds.
concurrent_requests
Number of concurrent requests allowed (1–20).
timeout
Request timeout in seconds (1–60).
max_pages
Maximum number of pages to scrape (1–100).
max_depth
Maximum crawl depth. 0 means single page only (0–5).
extract_type
Content extraction method: "fast_text", "clean_text", "full_text", "html", or "custom".
max_text_length
Maximum text length to return (100–100000).
clean_text
Whether to clean extracted text by collapsing whitespace.
selectors
Custom CSS or XPath selectors for structured data extraction. Keys are field names, values are selector strings.
follow_patterns
URL patterns (regex strings) to follow when crawling. Only links matching at least one pattern are followed.
user_agent
User agent string sent with each request.
headers
Additional HTTP headers to include with each request.
follow_robots_txt
Whether to respect robots.txt rules when crawling.
cache_enabled
Whether to cache scraped pages in memory to avoid re-fetching.