GitHub

Crawler Settings

Configure crawl behavior, limits, delays, and URL patterns

The [crawler] section controls how squirrelscan discovers and fetches pages.

Configuration

toml
[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = true

Crawl Limits

max_pages

Type: number Default: 100 (surface mode) Range: 1 to 5000 (capped by CLI)

Maximum number of pages to crawl per audit. When max_pages isn’t set, the default depends on the coverage mode: quick = 25, surface = 100 (the default mode), full = 500.

Examples:

Small site:

toml
[crawler]
max_pages = 50

Large site:

toml
[crawler]
max_pages = 2000

CLI override:

bash
squirrel audit https://example.com -m 100

Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config.


timeout_ms

Type: number Default: 30000 (30 seconds) Range: 1000 to 60000 recommended

Timeout for each page request in milliseconds.

Examples:

Fast timeout for quick sites:

toml
[crawler]
timeout_ms = 10000  # 10 seconds

Slow sites or APIs:

toml
[crawler]
timeout_ms = 45000  # 45 seconds

When request exceeds timeout:

  • Page marked as failed
  • Crawl continues with next URL
  • Logged in error output

Rate Limiting

delay_ms

Type: number Default: 100 (100ms)

Base delay between requests in milliseconds.

Examples:

Fast crawl (be careful):

toml
[crawler]
delay_ms = 50

Polite crawl:

toml
[crawler]
delay_ms = 500

No delay (local development only):

toml
[crawler]
delay_ms = 0

Note: Actual delays depend on per_host_delay_ms and concurrency settings.


per_host_delay_ms

Type: number Default: 200 (200ms)

Minimum delay between requests to the same host.

Ensures politeness even with high concurrency.

Examples:

Aggressive (use cautiously):

toml
[crawler]
per_host_delay_ms = 100

Very polite:

toml
[crawler]
per_host_delay_ms = 1000  # 1 second

How it works:

With per_host_concurrency = 2 and per_host_delay_ms = 200:

  • At most 2 concurrent requests to same host
  • At least 200ms between requests to same host
  • Other hosts can be fetched simultaneously

concurrency

Type: number Default: 5 Range: 1 to 20 recommended

Maximum number of concurrent requests globally.

Examples:

Sequential (single request at a time):

toml
[crawler]
concurrency = 1

Moderate parallelism:

toml
[crawler]
concurrency = 10

High parallelism (use cautiously):

toml
[crawler]
concurrency = 20

Impact:

  • Higher = faster crawls
  • Higher = more server load
  • Bounded by per_host_concurrency

per_host_concurrency

Type: number Default: 2 Range: 1 to 5 recommended

Maximum number of concurrent requests per host.

Prevents overwhelming a single server even with high global concurrency.

Examples:

One request per host at a time:

toml
[crawler]
per_host_concurrency = 1

Allow more parallel requests:

toml
[crawler]
per_host_concurrency = 4

How it interacts with concurrency:

toml
[crawler]
concurrency = 10
per_host_concurrency = 2
  • Up to 10 total concurrent requests
  • At most 2 concurrent requests to any single host
  • Can fetch from up to 5 different hosts simultaneously

URL Filtering

include

Type: string[] Default: [] (empty = include all URLs from seed domain)

URL patterns to include. If set, only matching URLs are crawled.

Pattern Syntax:

Uses glob syntax:

  • * - Match anything except /
  • ** - Match anything including /
  • ? - Match single character
  • [abc] - Match character set

Examples:

Only crawl blog:

toml
[crawler]
include = ["/blog/**"]

Multiple sections:

toml
[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]

Specific file types:

toml
[crawler]
include = ["*.html", "*.htm"]

Important: When include is set, it overrides the domains setting in [project].


exclude

Type: string[] Default: [] (empty = exclude nothing)

URL patterns to exclude from crawling.

Takes precedence over include - if a URL matches both, it’s excluded.

Examples:

Exclude admin areas:

toml
[crawler]
exclude = ["/admin/**", "/wp-admin/**"]

Exclude file types:

toml
[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]

Exclude API endpoints:

toml
[crawler]
exclude = ["/api/**", "/v1/**"]

Exclude query parameters:

toml
[crawler]
exclude = ["*?preview=*", "*?draft=*"]

Common exclusions:

toml
[crawler]
exclude = [
  "/admin/**",
  "/wp-admin/**",
  "/wp-content/**",
  "/api/**",
  "*.pdf",
  "*.zip",
  "*.jpg",
  "*.png",
  "*?preview=*",
  "*?print=*"
]

Pattern Matching Examples

PatternMatchesDoesn’t Match
/blog/*/blog/post/blog/post/comment
/blog/**/blog/post, /blog/post/comment/about
*.pdf/file.pdf, /docs/guide.pdf/file.html
*?preview=*/page?preview=true/page
/api/*/users/api/v1/users/api/v1/v2/users

Query Parameters

allow_query_params

Type: string[] Default: [] (empty = drop all query params for deduplication)

Query parameters to preserve during URL deduplication.

Why this matters:

URLs are deduplicated before crawling:

  • /page?id=1&utm_source=google/page?id=1 (utm dropped)

Without configuration, all query params are dropped except those in allow_query_params.

Examples:

Preserve pagination:

toml
[crawler]
allow_query_params = ["page"]

Preserve filters:

toml
[crawler]
allow_query_params = ["category", "sort", "filter", "q"]

Preserve all query params:

toml
[crawler]
allow_query_params = ["*"]

Use case:

E-commerce site with filters:

toml
[crawler]
allow_query_params = ["category", "price", "brand", "page"]

This preserves:

  • /products?category=shoes
  • /products?category=shoes&page=2

This drops:

  • /products?utm_source=google ✗ (becomes /products)
  • /products?gclid=abc123 ✗ (becomes /products)

drop_query_prefixes

Type: string[] Default: ["utm_", "gclid", "fbclid"]

Query parameter prefixes to always drop, even if in allow_query_params.

Default tracking params dropped:

  • utm_* - Google Analytics (utm_source, utm_medium, etc.)
  • gclid - Google Ads
  • fbclid - Facebook Ads

Examples:

Drop more tracking params:

toml
[crawler]
drop_query_prefixes = [
  "utm_",
  "gclid",
  "fbclid",
  "mc_",       # Mailchimp
  "_ga",       # Google Analytics
  "ref",       # Referrer
  "source"     # Generic source tracking
]

Drop nothing:

toml
[crawler]
drop_query_prefixes = []

Crawl Strategy

breadth_first

Type: boolean Default: true

Use breadth-first crawling for better site coverage.

Breadth-first (default):

  • Crawls level-by-level
  • Discovers homepage, then all links from homepage, then all links from those pages
  • Better site coverage
  • Avoids getting stuck in deep paths

Depth-first (false):

  • Crawls as deep as possible before backtracking
  • Can get stuck in deep sections
  • Less even coverage

Example:

Disable breadth-first:

toml
[crawler]
breadth_first = false

Recommendation: Keep true (default) for most sites.


max_prefix_budget

Type: number Default: 0.25 (25%) Range: 0.1 to 1.0

Maximum percentage of crawl budget for any single path prefix.

Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).

How it works:

With max_pages = 500 and max_prefix_budget = 0.25:

  • At most 125 pages (25%) from any single path prefix
  • Ensures diverse coverage across site sections

Examples:

More strict (better coverage):

toml
[crawler]
max_prefix_budget = 0.15  # Max 15% per prefix

More lenient (deeper coverage):

toml
[crawler]
max_prefix_budget = 0.5   # Max 50% per prefix

Disable budget (not recommended):

toml
[crawler]
max_prefix_budget = 1.0   # No limit

Use case:

Site with large blog:

/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact

With max_prefix_budget = 0.25 and max_pages = 500:

  • At most 125 blog posts crawled
  • Remaining budget for other sections

Request Configuration

user_agent

Type: string Default: "" (empty = random browser user agent per crawl)

Custom user agent string.

Default behavior (empty string):

Random modern browser user agent, refreshed per crawl:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36

Examples:

Custom user agent:

toml
[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"

Mobile user agent:

toml
[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"

Recommendation: Leave empty (default) for best results with bot protection.


follow_redirects

Type: boolean Default: true

Follow HTTP 3xx redirects.

When true (default):

  • Follows redirects automatically
  • Crawls final destination URL
  • Redirect chains tracked for analysis

When false:

  • Stops at redirect
  • Does not fetch redirect destination
  • Useful for debugging redirect issues

Example:

Disable redirect following:

toml
[crawler]
follow_redirects = false

Recommendation: Keep true (default) for normal audits.


Robots.txt

respect_robots

Type: boolean Default: true

Obey robots.txt rules and crawl-delay directives.

When true (default):

  • Fetches and parses robots.txt
  • Respects Disallow: rules
  • Honors Crawl-delay: directive
  • Polite and ethical

When false:

  • Ignores robots.txt
  • Crawls all URLs (including disallowed)
  • Use only for your own sites

Example:

Ignore robots.txt (testing only):

toml
[crawler]
respect_robots = false

Recommendation: Always keep true (default) when crawling third-party sites.


Complete Examples

Fast Local Development

toml
[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = false

Polite Production Crawl

toml
[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = true

High-Volume Crawl

toml
[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2

Focused Blog Crawl

toml
[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]

E-commerce Site

toml
[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]

Type to search…

↑↓ navigate open esc close