URL: /configuration/crawler

---
title: "Crawler Settings"
description: "Configure crawl behavior, limits, delays, and URL patterns"
---

The `[crawler]` section controls how squirrelscan discovers and fetches pages.

## Configuration

```toml
[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = true
```

## Crawl Limits

### `max_pages`

**Type:** `number`
**Default:** `100` (surface mode)
**Range:** `1` to `5000` (capped by CLI)

Maximum number of pages to crawl per audit. When `max_pages` isn't set, the default depends on the [coverage mode](/crawl): `quick` = 25, `surface` = 100 (the default mode), `full` = 500.

**Examples:**

Small site:
```toml
[crawler]
max_pages = 50
```

Large site:
```toml
[crawler]
max_pages = 2000
```

**CLI override:**
```bash
squirrel audit https://example.com -m 100
```

**Note:** The CLI enforces a hard cap (currently 5,000 pages) regardless of config.

---

### `timeout_ms`

**Type:** `number`
**Default:** `30000` (30 seconds)
**Range:** `1000` to `60000` recommended

Timeout for each page request in milliseconds.

**Examples:**

Fast timeout for quick sites:
```toml
[crawler]
timeout_ms = 10000  # 10 seconds
```

Slow sites or APIs:
```toml
[crawler]
timeout_ms = 45000  # 45 seconds
```

**When request exceeds timeout:**
- Page marked as failed
- Crawl continues with next URL
- Logged in error output

---

## Rate Limiting

### `delay_ms`

**Type:** `number`
**Default:** `100` (100ms)

Base delay between requests in milliseconds.

**Examples:**

Fast crawl (be careful):
```toml
[crawler]
delay_ms = 50
```

Polite crawl:
```toml
[crawler]
delay_ms = 500
```

No delay (local development only):
```toml
[crawler]
delay_ms = 0
```

**Note:** Actual delays depend on `per_host_delay_ms` and concurrency settings.

---

### `per_host_delay_ms`

**Type:** `number`
**Default:** `200` (200ms)

Minimum delay between requests to the same host.

Ensures politeness even with high concurrency.

**Examples:**

Aggressive (use cautiously):
```toml
[crawler]
per_host_delay_ms = 100
```

Very polite:
```toml
[crawler]
per_host_delay_ms = 1000  # 1 second
```

**How it works:**

With `per_host_concurrency = 2` and `per_host_delay_ms = 200`:
- At most 2 concurrent requests to same host
- At least 200ms between requests to same host
- Other hosts can be fetched simultaneously

---

### `concurrency`

**Type:** `number`
**Default:** `5`
**Range:** `1` to `20` recommended

Maximum number of concurrent requests globally.

**Examples:**

Sequential (single request at a time):
```toml
[crawler]
concurrency = 1
```

Moderate parallelism:
```toml
[crawler]
concurrency = 10
```

High parallelism (use cautiously):
```toml
[crawler]
concurrency = 20
```

**Impact:**
- Higher = faster crawls
- Higher = more server load
- Bounded by `per_host_concurrency`

---

### `per_host_concurrency`

**Type:** `number`
**Default:** `2`
**Range:** `1` to `5` recommended

Maximum number of concurrent requests per host.

Prevents overwhelming a single server even with high global concurrency.

**Examples:**

One request per host at a time:
```toml
[crawler]
per_host_concurrency = 1
```

Allow more parallel requests:
```toml
[crawler]
per_host_concurrency = 4
```

**How it interacts with `concurrency`:**

```toml
[crawler]
concurrency = 10
per_host_concurrency = 2
```

- Up to 10 total concurrent requests
- At most 2 concurrent requests to any single host
- Can fetch from up to 5 different hosts simultaneously

---

## URL Filtering

### `include`

**Type:** `string[]`
**Default:** `[]` (empty = include all URLs from seed domain)

URL patterns to include. If set, **only** matching URLs are crawled.

**Pattern Syntax:**

Uses glob syntax:
- `*` - Match anything except `/`
- `**` - Match anything including `/`
- `?` - Match single character
- `[abc]` - Match character set

**Examples:**

Only crawl blog:
```toml
[crawler]
include = ["/blog/**"]
```

Multiple sections:
```toml
[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]
```

Specific file types:
```toml
[crawler]
include = ["*.html", "*.htm"]
```

**Important:** When `include` is set, it **overrides** the `domains` setting in `[project]`.

---

### `exclude`

**Type:** `string[]`
**Default:** `[]` (empty = exclude nothing)

URL patterns to exclude from crawling.

Takes precedence over `include` - if a URL matches both, it's excluded.

**Examples:**

Exclude admin areas:
```toml
[crawler]
exclude = ["/admin/**", "/wp-admin/**"]
```

Exclude file types:
```toml
[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]
```

Exclude API endpoints:
```toml
[crawler]
exclude = ["/api/**", "/v1/**"]
```

Exclude query parameters:
```toml
[crawler]
exclude = ["*?preview=*", "*?draft=*"]
```

**Common exclusions:**

```toml
[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]
```

---

### Pattern Matching Examples

| Pattern | Matches | Doesn't Match |
|---------|---------|---------------|
| `/blog/*` | `/blog/post` | `/blog/post/comment` |
| `/blog/**` | `/blog/post`, `/blog/post/comment` | `/about` |
| `*.pdf` | `/file.pdf`, `/docs/guide.pdf` | `/file.html` |
| `*?preview=*` | `/page?preview=true` | `/page` |
| `/api/*/users` | `/api/v1/users` | `/api/v1/v2/users` |

---

## Query Parameters

### `allow_query_params`

**Type:** `string[]`
**Default:** `[]` (empty = drop all query params for deduplication)

Query parameters to preserve during URL deduplication.

**Why this matters:**

URLs are deduplicated before crawling:
- `/page?id=1&utm_source=google` → `/page?id=1` (utm dropped)

Without configuration, all query params are dropped except those in `allow_query_params`.

**Examples:**

Preserve pagination:
```toml
[crawler]
allow_query_params = ["page"]
```

Preserve filters:
```toml
[crawler]
allow_query_params = ["category", "sort", "filter", "q"]
```

Preserve all query params:
```toml
[crawler]
allow_query_params = ["*"]
```

**Use case:**

E-commerce site with filters:
```toml
[crawler]
allow_query_params = ["category", "price", "brand", "page"]
```

This preserves:
- `/products?category=shoes` ✓
- `/products?category=shoes&page=2` ✓

This drops:
- `/products?utm_source=google` ✗ (becomes `/products`)
- `/products?gclid=abc123` ✗ (becomes `/products`)

---

### `drop_query_prefixes`

**Type:** `string[]`
**Default:** `["utm_", "gclid", "fbclid"]`

Query parameter prefixes to always drop, even if in `allow_query_params`.

**Default tracking params dropped:**
- `utm_*` - Google Analytics (utm_source, utm_medium, etc.)
- `gclid` - Google Ads
- `fbclid` - Facebook Ads

**Examples:**

Drop more tracking params:
```toml
[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]
```

Drop nothing:
```toml
[crawler]
drop_query_prefixes = []
```

---

## Crawl Strategy

### `breadth_first`

**Type:** `boolean`
**Default:** `true`

Use breadth-first crawling for better site coverage.

**Breadth-first (default):**
- Crawls level-by-level
- Discovers homepage, then all links from homepage, then all links from those pages
- Better site coverage
- Avoids getting stuck in deep paths

**Depth-first (`false`):**
- Crawls as deep as possible before backtracking
- Can get stuck in deep sections
- Less even coverage

**Example:**

Disable breadth-first:
```toml
[crawler]
breadth_first = false
```

**Recommendation:** Keep `true` (default) for most sites.

---

### `max_prefix_budget`

**Type:** `number`
**Default:** `0.25` (25%)
**Range:** `0.1` to `1.0`

Maximum percentage of crawl budget for any single path prefix.

Prevents the crawler from spending all pages on one section (e.g., `/blog/` with 1000+ posts).

**How it works:**

With `max_pages = 500` and `max_prefix_budget = 0.25`:
- At most 125 pages (25%) from any single path prefix
- Ensures diverse coverage across site sections

**Examples:**

More strict (better coverage):
```toml
[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix
```

More lenient (deeper coverage):
```toml
[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix
```

**Disable budget (not recommended):**
```toml
[crawler]
max_prefix_budget = 1.0   # No limit
```

**Use case:**

Site with large blog:
```
/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact
```

With `max_prefix_budget = 0.25` and `max_pages = 500`:
- At most 125 blog posts crawled
- Remaining budget for other sections

---

## Request Configuration

### `user_agent`

**Type:** `string`
**Default:** `""` (empty = random browser user agent per crawl)

Custom user agent string.

**Default behavior (empty string):**

Random modern browser user agent, refreshed per crawl:
```
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
```

**Examples:**

Custom user agent:
```toml
[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"
```

Mobile user agent:
```toml
[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"
```

**Recommendation:** Leave empty (default) for best results with bot protection.

---

### `follow_redirects`

**Type:** `boolean`
**Default:** `true`

Follow HTTP 3xx redirects.

**When `true` (default):**
- Follows redirects automatically
- Crawls final destination URL
- Redirect chains tracked for analysis

**When `false`:**
- Stops at redirect
- Does not fetch redirect destination
- Useful for debugging redirect issues

**Example:**

Disable redirect following:
```toml
[crawler]
follow_redirects = false
```

**Recommendation:** Keep `true` (default) for normal audits.

---

## Robots.txt

### `respect_robots`

**Type:** `boolean`
**Default:** `true`

Obey robots.txt rules and crawl-delay directives.

**When `true` (default):**
- Fetches and parses robots.txt
- Respects `Disallow:` rules
- Honors `Crawl-delay:` directive
- Polite and ethical

**When `false`:**
- Ignores robots.txt
- Crawls all URLs (including disallowed)
- Use only for your own sites

**Example:**

Ignore robots.txt (testing only):
```toml
[crawler]
respect_robots = false
```

**Recommendation:** Always keep `true` (default) when crawling third-party sites.

---

## Complete Examples

### Fast Local Development

```toml
[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = false
```

### Polite Production Crawl

```toml
[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true
```

### High-Volume Crawl

```toml
[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2
```

### Focused Blog Crawl

```toml
[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]
```

### E-commerce Site

```toml
[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]
```

## Related

- [Project Settings](/configuration/project) - Domains configuration
- [Rules Configuration](/configuration/rules) - Which rules to run
- [Examples](/configuration/examples) - More configuration examples
