Crawler Settings

Configure crawl behavior, limits, delays, and URL patterns

The [crawler] section controls how squirrelscan discovers and fetches pages.

Configuration

toml

[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = true

Crawl Limits

`max_pages`

Type: number Default: 100 (surface mode) Range: 1 to 5000 (capped by CLI)

Maximum number of pages to crawl per audit. When max_pages isn’t set, the default depends on the coverage mode: quick = 25, surface = 100 (the default mode), full = 500.

Examples:

Small site:

toml

[crawler]
max_pages = 50

Large site:

toml

[crawler]
max_pages = 2000

CLI override:

bash

squirrel audit https://example.com -m 100

Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config.

`timeout_ms`

Type: number Default: 30000 (30 seconds) Range: 1000 to 60000 recommended

Timeout for each page request in milliseconds.

Examples:

Fast timeout for quick sites:

toml

[crawler]
timeout_ms = 10000  # 10 seconds

Slow sites or APIs:

toml

[crawler]
timeout_ms = 45000  # 45 seconds

When request exceeds timeout:

Page marked as failed
Crawl continues with next URL
Logged in error output

Rate Limiting

`delay_ms`

Type: number Default: 100 (100ms)

Base delay between requests in milliseconds.

Examples:

Fast crawl (be careful):

toml

[crawler]
delay_ms = 50

Polite crawl:

toml

[crawler]
delay_ms = 500

No delay (local development only):

toml

[crawler]
delay_ms = 0

Note: Actual delays depend on per_host_delay_ms and concurrency settings.

`per_host_delay_ms`

Type: number Default: 200 (200ms)

Minimum delay between requests to the same host.

Ensures politeness even with high concurrency.

Examples:

Aggressive (use cautiously):

toml

[crawler]
per_host_delay_ms = 100

Very polite:

toml

[crawler]
per_host_delay_ms = 1000  # 1 second

How it works:

With per_host_concurrency = 2 and per_host_delay_ms = 200:

At most 2 concurrent requests to same host
At least 200ms between requests to same host
Other hosts can be fetched simultaneously

`concurrency`

Type: number Default: 5 Range: 1 to 20 recommended

Maximum number of concurrent requests globally.

Examples:

Sequential (single request at a time):

toml

[crawler]
concurrency = 1

Moderate parallelism:

toml

[crawler]
concurrency = 10

High parallelism (use cautiously):

toml

[crawler]
concurrency = 20

Impact:

Higher = faster crawls
Higher = more server load
Bounded by per_host_concurrency

`per_host_concurrency`

Type: number Default: 2 Range: 1 to 5 recommended

Maximum number of concurrent requests per host.

Prevents overwhelming a single server even with high global concurrency.

Examples:

One request per host at a time:

toml

[crawler]
per_host_concurrency = 1

Allow more parallel requests:

toml

[crawler]
per_host_concurrency = 4

How it interacts with concurrency:

toml

[crawler]
concurrency = 10
per_host_concurrency = 2

Up to 10 total concurrent requests
At most 2 concurrent requests to any single host
Can fetch from up to 5 different hosts simultaneously

URL Filtering

`include`

Type: string[] Default: [] (empty = include all URLs from seed domain)

URL patterns to include. If set, only matching URLs are crawled.

Pattern Syntax:

Uses glob syntax:

* - Match anything except /
** - Match anything including /
? - Match single character
[abc] - Match character set

Examples:

Only crawl blog:

toml

[crawler]
include = ["/blog/**"]

Multiple sections:

toml

[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]

Specific file types:

toml

[crawler]
include = ["*.html", "*.htm"]

Important: When include is set, it overrides the domains setting in [project].

`exclude`

Type: string[] Default: [] (empty = exclude nothing)

URL patterns to exclude from crawling.

Takes precedence over include - if a URL matches both, it’s excluded.

Examples:

Exclude admin areas:

toml

[crawler]
exclude = ["/admin/**", "/wp-admin/**"]

Exclude file types:

toml

[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]

Exclude API endpoints:

toml

[crawler]
exclude = ["/api/**", "/v1/**"]

Exclude query parameters:

toml

[crawler]
exclude = ["*?preview=*", "*?draft=*"]

Common exclusions:

toml

[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]

Pattern Matching Examples

Pattern	Matches	Doesn’t Match
`/blog/*`	`/blog/post`	`/blog/post/comment`
`/blog/**`	`/blog/post`, `/blog/post/comment`	`/about`
`*.pdf`	`/file.pdf`, `/docs/guide.pdf`	`/file.html`
`?preview=`	`/page?preview=true`	`/page`
`/api/*/users`	`/api/v1/users`	`/api/v1/v2/users`

Query Parameters

`allow_query_params`

Type: string[] Default: [] (empty = drop all query params for deduplication)

Query parameters to preserve during URL deduplication.

Why this matters:

URLs are deduplicated before crawling:

/page?id=1&utm_source=google → /page?id=1 (utm dropped)

Without configuration, all query params are dropped except those in allow_query_params.

Examples:

Preserve pagination:

toml

[crawler]
allow_query_params = ["page"]

Preserve filters:

toml

[crawler]
allow_query_params = ["category", "sort", "filter", "q"]

Preserve all query params:

toml

[crawler]
allow_query_params = ["*"]

Use case:

E-commerce site with filters:

toml

[crawler]
allow_query_params = ["category", "price", "brand", "page"]

This preserves:

/products?category=shoes ✓
/products?category=shoes&page=2 ✓

This drops:

/products?utm_source=google ✗ (becomes /products)
/products?gclid=abc123 ✗ (becomes /products)

`drop_query_prefixes`

Type: string[] Default: ["utm_", "gclid", "fbclid"]

Query parameter prefixes to always drop, even if in allow_query_params.

Default tracking params dropped:

utm_* - Google Analytics (utm_source, utm_medium, etc.)
gclid - Google Ads
fbclid - Facebook Ads

Examples:

Drop more tracking params:

toml

[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]

Drop nothing:

toml

[crawler]
drop_query_prefixes = []

Crawl Strategy

`breadth_first`

Type: boolean Default: true

Use breadth-first crawling for better site coverage.

Breadth-first (default):

Crawls level-by-level
Discovers homepage, then all links from homepage, then all links from those pages
Better site coverage
Avoids getting stuck in deep paths

Depth-first (false):

Crawls as deep as possible before backtracking
Can get stuck in deep sections
Less even coverage

Example:

Disable breadth-first:

toml

[crawler]
breadth_first = false

Recommendation: Keep true (default) for most sites.

`max_prefix_budget`

Type: number Default: 0.25 (25%) Range: 0.1 to 1.0

Maximum percentage of crawl budget for any single path prefix.

Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).

How it works:

With max_pages = 500 and max_prefix_budget = 0.25:

At most 125 pages (25%) from any single path prefix
Ensures diverse coverage across site sections

Examples:

More strict (better coverage):

toml

[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix

More lenient (deeper coverage):

toml

[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix

Disable budget (not recommended):

toml

[crawler]
max_prefix_budget = 1.0   # No limit

Use case:

Site with large blog:

/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact

With max_prefix_budget = 0.25 and max_pages = 500:

At most 125 blog posts crawled
Remaining budget for other sections

Request Configuration

`user_agent`

Type: string Default: "" (empty = random browser user agent per crawl)

Custom user agent string.

Default behavior (empty string):

Random modern browser user agent, refreshed per crawl:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36

Examples:

Custom user agent:

toml

[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"

Mobile user agent:

toml

[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"

Recommendation: Leave empty (default) for best results with bot protection.

`follow_redirects`

Type: boolean Default: true

Follow HTTP 3xx redirects.

When true (default):

Follows redirects automatically
Crawls final destination URL
Redirect chains tracked for analysis

When false:

Stops at redirect
Does not fetch redirect destination
Useful for debugging redirect issues

Example:

Disable redirect following:

toml

[crawler]
follow_redirects = false

Recommendation: Keep true (default) for normal audits.

Robots.txt

`respect_robots`

Type: boolean Default: true

Obey robots.txt rules and crawl-delay directives.

When true (default):

Fetches and parses robots.txt
Respects Disallow: rules
Honors Crawl-delay: directive
Polite and ethical

When false:

Ignores robots.txt
Crawls all URLs (including disallowed)
Use only for your own sites

Example:

Ignore robots.txt (testing only):

toml

[crawler]
respect_robots = false

Recommendation: Always keep true (default) when crawling third-party sites.

Complete Examples

Fast Local Development

toml

[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = false

Polite Production Crawl

toml

[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true

High-Volume Crawl

toml

[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2

Focused Blog Crawl

toml

[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]

E-commerce Site

toml

[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]

Project Settings - Domains configuration
Rules Configuration - Which rules to run
Examples - More configuration examples

Edit this page

Configuration

Crawl Limits

max_pages

timeout_ms

Rate Limiting

delay_ms

per_host_delay_ms

concurrency

per_host_concurrency

URL Filtering

include

exclude

Pattern Matching Examples

Query Parameters

allow_query_params

drop_query_prefixes

Crawl Strategy

breadth_first

max_prefix_budget

Request Configuration

user_agent

follow_redirects

Robots.txt

respect_robots

Complete Examples

Fast Local Development

Polite Production Crawl

High-Volume Crawl

Focused Blog Crawl

E-commerce Site

Related

`max_pages`

`timeout_ms`

`delay_ms`

`per_host_delay_ms`

`concurrency`

`per_host_concurrency`

`include`

`exclude`

`allow_query_params`

`drop_query_prefixes`

`breadth_first`

`max_prefix_budget`

`user_agent`

`follow_redirects`

`respect_robots`