Website & URL Crawling
Crawl websites and documentation sites to automatically index their content. Perfect for help centers, docs sites, and public knowledge bases.
How It Works
The web crawler starts from a URL you provide and follows links to discover and index pages. It automatically:
- Extracts clean text content from HTML pages
- Follows links within the same domain
- Respects robots.txt directives
- Removes navigation, headers, footers, and scripts
- Handles pagination and multi-page content
Setting Up Web Crawling
Add Web Source
From Knowledge Base, click + Add Source → Websites & URLs.
Enter Starting URL
Provide the URL where crawling should begin. This is typically a homepage, docs landing page, or sitemap.
Best Starting Points
- Documentation index:
https://docs.example.com - Help center:
https://help.example.com - Specific section:
https://example.com/guides/
Configure Crawl Settings
Set depth and page limits based on your needs:
- Max depth: How many link levels to follow (1-5)
- Max pages: Total pages to crawl (up to 500)
- URL patterns: Include or exclude specific paths
Start Crawl
Click Connect to begin crawling. Progress is shown in real-time as pages are discovered and indexed.
Crawl Settings
Start URLstringdefault: requiredThe URL where crawling begins
Max depthintegerdefault: 2How many link levels deep to crawl (1 = start page only, 5 = maximum)
Max pagesintegerdefault: 50Maximum number of pages to crawl (up to 500)
Include patternsarraydefault: noneRegex patterns for URLs to include (optional)
Exclude patternsarraydefault: noneRegex patterns for URLs to exclude (optional)
Respect robots.txtbooleandefault: trueHonor website crawling rules
URL Patterns
Use regex patterns to control which pages are crawled. This is useful for large sites where you only need specific sections.
Include Patterns
When set, only URLs matching at least one pattern are crawled:
# Only crawl docs pages /docs/.* # Only crawl specific versions /v2\.0/.* # Only English content /en/.*
Exclude Patterns
URLs matching any exclude pattern are skipped:
# Skip changelog and release notes /changelog.* /releases.* # Skip API reference (too large) /api-reference/.* # Skip login/auth pages /login.* /signup.*
Understanding Crawl Depth
| Depth | What Gets Crawled | Best For |
|---|---|---|
| 1 | Start URL only | Single page content |
| 2 | Start URL + directly linked pages | Small docs sections |
| 3 | Two levels of links from start | Most documentation sites |
| 4-5 | Deep crawl of entire site sections | Large knowledge bases |
Depth vs Pages
Higher depth doesn't always mean more pages. A depth of 3 on a small site might find 20 pages, while depth 2 on a large site could hit the 500 page limit.
Content Extraction
The crawler automatically extracts meaningful content by:
- Removing boilerplate: Navigation menus, footers, sidebars, and headers are stripped
- Cleaning scripts: JavaScript, CSS, and tracking code are removed
- Preserving structure: Headings and paragraphs are maintained for context
- Extracting titles: Page titles become document titles in search results
Robots.txt Compliance
By default, the crawler respects robots.txt rules. This means:
- Pages disallowed by robots.txt are skipped
- Crawl delays are honored
- The crawler identifies itself as a bot
Private Content
The crawler can only access publicly available pages. For authenticated content, use direct integrations like Notion or Google Drive.
Best Practices
- Start specific: Begin with a focused section rather than the entire site
- Use patterns wisely: Exclude large sections you don't need (changelogs, API specs)
- Monitor page count: Start with lower limits and increase as needed
- Re-crawl periodically: Trigger manual syncs when source content updates
- Check robots.txt: Ensure the content you need isn't blocked
Troubleshooting
No pages found
- Verify the start URL is accessible in a browser
- Check if robots.txt blocks crawling
- Ensure the site doesn't require authentication
- Try disabling "Respect robots.txt" temporarily
Too few pages crawled
- Increase max depth setting
- Check if include patterns are too restrictive
- Verify links use standard HTML anchor tags
- Some sites use JavaScript navigation (not supported)
Content not extracting properly
- The site may use heavy JavaScript rendering
- Content might be in iframes (not crawled)
- PDF or image-only pages won't have text
Crawl taking too long
- Reduce max pages limit
- Lower crawl depth
- Add exclude patterns for large sections
Limitations
- JavaScript rendering: Single-page apps (SPAs) that require JavaScript to render content may not crawl properly
- Authentication: Login-protected content cannot be accessed
- Rate limiting: The crawler pauses between requests to be polite; very large sites take time
- Same domain only: Links to external domains are not followed
Removing the Integration
To remove a web crawl data source, simply delete it from your Knowledge Base. All indexed pages will be removed from your bot's knowledge.

