Crawler Settings
Configure crawl behavior, limits, delays, and URL patterns
The [crawler] section controls how squirrelscan discovers and fetches pages.
Configuration
[crawler]
max_pages = 100
delay_ms = 100
timeout_ms = 30000
concurrency = 5
per_host_concurrency = 2
per_host_delay_ms = 200
include = []
exclude = []
allow_query_params = []
drop_query_prefixes = ["utm_", "gclid", "fbclid"]
respect_robots = true
breadth_first = true
max_prefix_budget = 0.25
user_agent = ""
follow_redirects = trueCrawl Limits
max_pages
Type: number
Default: 100 (surface mode)
Range: 1 to 5000 (capped by CLI)
Maximum number of pages to crawl per audit. When max_pages isn’t set, the default depends on the coverage mode: quick = 25, surface = 100 (the default mode), full = 500.
Examples:
Small site:
[crawler]
max_pages = 50Large site:
[crawler]
max_pages = 2000CLI override:
squirrel audit https://example.com -m 100Note: The CLI enforces a hard cap (currently 5,000 pages) regardless of config.
timeout_ms
Type: number
Default: 30000 (30 seconds)
Range: 1000 to 60000 recommended
Timeout for each page request in milliseconds.
Examples:
Fast timeout for quick sites:
[crawler]
timeout_ms = 10000 # 10 secondsSlow sites or APIs:
[crawler]
timeout_ms = 45000 # 45 secondsWhen request exceeds timeout:
- Page marked as failed
- Crawl continues with next URL
- Logged in error output
Rate Limiting
delay_ms
Type: number
Default: 100 (100ms)
Base delay between requests in milliseconds.
Examples:
Fast crawl (be careful):
[crawler]
delay_ms = 50Polite crawl:
[crawler]
delay_ms = 500No delay (local development only):
[crawler]
delay_ms = 0Note: Actual delays depend on per_host_delay_ms and concurrency settings.
per_host_delay_ms
Type: number
Default: 200 (200ms)
Minimum delay between requests to the same host.
Ensures politeness even with high concurrency.
Examples:
Aggressive (use cautiously):
[crawler]
per_host_delay_ms = 100Very polite:
[crawler]
per_host_delay_ms = 1000 # 1 secondHow it works:
With per_host_concurrency = 2 and per_host_delay_ms = 200:
- At most 2 concurrent requests to same host
- At least 200ms between requests to same host
- Other hosts can be fetched simultaneously
concurrency
Type: number
Default: 5
Range: 1 to 20 recommended
Maximum number of concurrent requests globally.
Examples:
Sequential (single request at a time):
[crawler]
concurrency = 1Moderate parallelism:
[crawler]
concurrency = 10High parallelism (use cautiously):
[crawler]
concurrency = 20Impact:
- Higher = faster crawls
- Higher = more server load
- Bounded by
per_host_concurrency
per_host_concurrency
Type: number
Default: 2
Range: 1 to 5 recommended
Maximum number of concurrent requests per host.
Prevents overwhelming a single server even with high global concurrency.
Examples:
One request per host at a time:
[crawler]
per_host_concurrency = 1Allow more parallel requests:
[crawler]
per_host_concurrency = 4How it interacts with concurrency:
[crawler]
concurrency = 10
per_host_concurrency = 2- Up to 10 total concurrent requests
- At most 2 concurrent requests to any single host
- Can fetch from up to 5 different hosts simultaneously
URL Filtering
include
Type: string[]
Default: [] (empty = include all URLs from seed domain)
URL patterns to include. If set, only matching URLs are crawled.
Pattern Syntax:
Uses glob syntax:
*- Match anything except/**- Match anything including/?- Match single character[abc]- Match character set
Examples:
Only crawl blog:
[crawler]
include = ["/blog/**"]Multiple sections:
[crawler]
include = ["/blog/**", "/docs/**", "/products/**"]Specific file types:
[crawler]
include = ["*.html", "*.htm"]Important: When include is set, it overrides the domains setting in [project].
exclude
Type: string[]
Default: [] (empty = exclude nothing)
URL patterns to exclude from crawling.
Takes precedence over include - if a URL matches both, it’s excluded.
Examples:
Exclude admin areas:
[crawler]
exclude = ["/admin/**", "/wp-admin/**"]Exclude file types:
[crawler]
exclude = ["*.pdf", "*.zip", "*.tar.gz"]Exclude API endpoints:
[crawler]
exclude = ["/api/**", "/v1/**"]Exclude query parameters:
[crawler]
exclude = ["*?preview=*", "*?draft=*"]Common exclusions:
[crawler]
exclude = [
"/admin/**",
"/wp-admin/**",
"/wp-content/**",
"/api/**",
"*.pdf",
"*.zip",
"*.jpg",
"*.png",
"*?preview=*",
"*?print=*"
]Pattern Matching Examples
| Pattern | Matches | Doesn’t Match |
|---|---|---|
/blog/* | /blog/post | /blog/post/comment |
/blog/** | /blog/post, /blog/post/comment | /about |
*.pdf | /file.pdf, /docs/guide.pdf | /file.html |
*?preview=* | /page?preview=true | /page |
/api/*/users | /api/v1/users | /api/v1/v2/users |
Query Parameters
allow_query_params
Type: string[]
Default: [] (empty = drop all query params for deduplication)
Query parameters to preserve during URL deduplication.
Why this matters:
URLs are deduplicated before crawling:
/page?id=1&utm_source=google→/page?id=1(utm dropped)
Without configuration, all query params are dropped except those in allow_query_params.
Examples:
Preserve pagination:
[crawler]
allow_query_params = ["page"]Preserve filters:
[crawler]
allow_query_params = ["category", "sort", "filter", "q"]Preserve all query params:
[crawler]
allow_query_params = ["*"]Use case:
E-commerce site with filters:
[crawler]
allow_query_params = ["category", "price", "brand", "page"]This preserves:
/products?category=shoes✓/products?category=shoes&page=2✓
This drops:
/products?utm_source=google✗ (becomes/products)/products?gclid=abc123✗ (becomes/products)
drop_query_prefixes
Type: string[]
Default: ["utm_", "gclid", "fbclid"]
Query parameter prefixes to always drop, even if in allow_query_params.
Default tracking params dropped:
utm_*- Google Analytics (utm_source, utm_medium, etc.)gclid- Google Adsfbclid- Facebook Ads
Examples:
Drop more tracking params:
[crawler]
drop_query_prefixes = [
"utm_",
"gclid",
"fbclid",
"mc_", # Mailchimp
"_ga", # Google Analytics
"ref", # Referrer
"source" # Generic source tracking
]Drop nothing:
[crawler]
drop_query_prefixes = []Crawl Strategy
breadth_first
Type: boolean
Default: true
Use breadth-first crawling for better site coverage.
Breadth-first (default):
- Crawls level-by-level
- Discovers homepage, then all links from homepage, then all links from those pages
- Better site coverage
- Avoids getting stuck in deep paths
Depth-first (false):
- Crawls as deep as possible before backtracking
- Can get stuck in deep sections
- Less even coverage
Example:
Disable breadth-first:
[crawler]
breadth_first = falseRecommendation: Keep true (default) for most sites.
max_prefix_budget
Type: number
Default: 0.25 (25%)
Range: 0.1 to 1.0
Maximum percentage of crawl budget for any single path prefix.
Prevents the crawler from spending all pages on one section (e.g., /blog/ with 1000+ posts).
How it works:
With max_pages = 500 and max_prefix_budget = 0.25:
- At most 125 pages (25%) from any single path prefix
- Ensures diverse coverage across site sections
Examples:
More strict (better coverage):
[crawler]
max_prefix_budget = 0.15 # Max 15% per prefixMore lenient (deeper coverage):
[crawler]
max_prefix_budget = 0.5 # Max 50% per prefixDisable budget (not recommended):
[crawler]
max_prefix_budget = 1.0 # No limitUse case:
Site with large blog:
/blog/post-1
/blog/post-2
...
/blog/post-5000
/about
/contact
With max_prefix_budget = 0.25 and max_pages = 500:
- At most 125 blog posts crawled
- Remaining budget for other sections
Request Configuration
user_agent
Type: string
Default: "" (empty = random browser user agent per crawl)
Custom user agent string.
Default behavior (empty string):
Random modern browser user agent, refreshed per crawl:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36
Examples:
Custom user agent:
[crawler]
user_agent = "SquirrelScan Bot (https://squirrelscan.com)"Mobile user agent:
[crawler]
user_agent = "Mozilla/5.0 (iPhone; CPU iPhone OS 18_0 like Mac OS X) AppleWebKit/605.1.15"Recommendation: Leave empty (default) for best results with bot protection.
follow_redirects
Type: boolean
Default: true
Follow HTTP 3xx redirects.
When true (default):
- Follows redirects automatically
- Crawls final destination URL
- Redirect chains tracked for analysis
When false:
- Stops at redirect
- Does not fetch redirect destination
- Useful for debugging redirect issues
Example:
Disable redirect following:
[crawler]
follow_redirects = falseRecommendation: Keep true (default) for normal audits.
Robots.txt
respect_robots
Type: boolean
Default: true
Obey robots.txt rules and crawl-delay directives.
When true (default):
- Fetches and parses robots.txt
- Respects
Disallow:rules - Honors
Crawl-delay:directive - Polite and ethical
When false:
- Ignores robots.txt
- Crawls all URLs (including disallowed)
- Use only for your own sites
Example:
Ignore robots.txt (testing only):
[crawler]
respect_robots = falseRecommendation: Always keep true (default) when crawling third-party sites.
Complete Examples
Fast Local Development
[crawler]
max_pages = 50
delay_ms = 0
per_host_delay_ms = 0
concurrency = 10
respect_robots = falsePolite Production Crawl
[crawler]
max_pages = 500
delay_ms = 200
per_host_delay_ms = 500
concurrency = 5
per_host_concurrency = 2
respect_robots = trueHigh-Volume Crawl
[crawler]
max_pages = 2000
delay_ms = 100
per_host_delay_ms = 200
concurrency = 10
per_host_concurrency = 3
breadth_first = true
max_prefix_budget = 0.2Focused Blog Crawl
[crawler]
max_pages = 200
include = ["/blog/**"]
exclude = ["*.pdf", "/blog/drafts/**"]
allow_query_params = ["page"]E-commerce Site
[crawler]
max_pages = 1000
include = ["/products/**", "/categories/**"]
exclude = ["/cart/**", "/checkout/**", "/account/**"]
allow_query_params = ["category", "sort", "page", "filter"]
drop_query_prefixes = ["utm_", "gclid", "fbclid", "ref"]Related
- Project Settings - Domains configuration
- Rules Configuration - Which rules to run
- Examples - More configuration examples