***

title: spider
slug: /reference/python/agents/skills/spider
description: Fast web scraping and crawling.
---------------------

For a complete index of all SignalWire documentation pages, fetch https://signalwire.com/docs/llms.txt

Fast web scraping and crawling. Fetches web pages and extracts content optimized
for token efficiency.

**Tools:** `scrape_url`, `crawl_site`, `extract_structured_data`

**Requirements:** `lxml`

**Multi-instance:** Yes

<ParamField path="delay" type="float" default="0.1" toc={true}>
  Delay between requests in seconds.
</ParamField>

<ParamField path="concurrent_requests" type="int" default="5" toc={true}>
  Number of concurrent requests allowed (1–20).
</ParamField>

<ParamField path="timeout" type="int" default="5" toc={true}>
  Request timeout in seconds (1–60).
</ParamField>

<ParamField path="max_pages" type="int" default="1" toc={true}>
  Maximum number of pages to scrape (1–100).
</ParamField>

<ParamField path="max_depth" type="int" default="0" toc={true}>
  Maximum crawl depth. `0` means single page only (0–5).
</ParamField>

<ParamField path="extract_type" type="str" default="fast_text" toc={true}>
  Content extraction method: `"fast_text"`, `"clean_text"`, `"full_text"`, `"html"`, or `"custom"`.
</ParamField>

<ParamField path="max_text_length" type="int" default="10000" toc={true}>
  Maximum text length to return (100–100000).
</ParamField>

<ParamField path="clean_text" type="bool" default="True" toc={true}>
  Whether to clean extracted text by collapsing whitespace.
</ParamField>

<ParamField path="selectors" type="dict" default="{}" toc={true}>
  Custom CSS or XPath selectors for structured data extraction. Keys are field names, values are selector strings.
</ParamField>

<ParamField path="follow_patterns" type="list" default="[]" toc={true}>
  URL patterns (regex strings) to follow when crawling. Only links matching at least one pattern are followed.
</ParamField>

<ParamField path="user_agent" type="str" toc={true}>
  User agent string sent with each request.
</ParamField>

<ParamField path="headers" type="dict" default="{}" toc={true}>
  Additional HTTP headers to include with each request.
</ParamField>

<ParamField path="follow_robots_txt" type="bool" default="True" toc={true}>
  Whether to respect `robots.txt` rules when crawling.
</ParamField>

<ParamField path="cache_enabled" type="bool" default="True" toc={true}>
  Whether to cache scraped pages in memory to avoid re-fetching.
</ParamField>

```python
from signalwire import AgentBase

class MyAgent(AgentBase):
    def __init__(self):
        super().__init__(name="assistant", route="/assistant")
        self.set_prompt_text("You are a helpful assistant.")
        self.add_skill("spider", {
            "timeout": 10,
            "concurrent_requests": 3,
            "max_pages": 5
        })

agent = MyAgent()
agent.serve()
```