Website & URL Crawling

Crawl websites and documentation sites to automatically index their content. Perfect for help centers, docs sites, and public knowledge bases.

How It Works

The web crawler starts from a URL you provide and follows links to discover and index pages. It automatically:

Extracts clean text content from HTML pages
Follows links within the same domain
Respects robots.txt directives
Removes navigation, headers, footers, and scripts
Handles pagination and multi-page content

Setting Up Web Crawling

Add Web Source

From Knowledge Base, click + Add Source → Websites & URLs.

Enter Starting URL

Provide the URL where crawling should begin. This is typically a homepage, docs landing page, or sitemap.

Best Starting Points

Documentation index: https://docs.example.com
Help center: https://help.example.com
Specific section: https://example.com/guides/

Configure Crawl Settings

Set depth and page limits based on your needs:

Max depth: How many link levels to follow (1-5)
Max pages: Total pages to crawl (up to 500)
URL patterns: Include or exclude specific paths

Start Crawl

Click Connect to begin crawling. Progress is shown in real-time as pages are discovered and indexed.

Crawl Settings

Start URLstringdefault: required

The URL where crawling begins

Max depthintegerdefault: 2

How many link levels deep to crawl (1 = start page only, 5 = maximum)

Max pagesintegerdefault: 50

Maximum number of pages to crawl (up to 500)

Include patternsarraydefault: none

Regex patterns for URLs to include (optional)

Exclude patternsarraydefault: none

Regex patterns for URLs to exclude (optional)

Respect robots.txtbooleandefault: true

Honor website crawling rules

URL Patterns

Use regex patterns to control which pages are crawled. This is useful for large sites where you only need specific sections.

Include Patterns

When set, only URLs matching at least one pattern are crawled:

# Only crawl docs pages
/docs/.*

# Only crawl specific versions
/v2\.0/.*

# Only English content
/en/.*

Exclude Patterns

URLs matching any exclude pattern are skipped:

# Skip changelog and release notes
/changelog.*
/releases.*

# Skip API reference (too large)
/api-reference/.*

# Skip login/auth pages
/login.*
/signup.*

Understanding Crawl Depth

Depth	What Gets Crawled	Best For
1	Start URL only	Single page content
2	Start URL + directly linked pages	Small docs sections
3	Two levels of links from start	Most documentation sites
4-5	Deep crawl of entire site sections	Large knowledge bases

Depth vs Pages

Higher depth doesn't always mean more pages. A depth of 3 on a small site might find 20 pages, while depth 2 on a large site could hit the 500 page limit.

Content Extraction

The crawler automatically extracts meaningful content by:

Removing boilerplate: Navigation menus, footers, sidebars, and headers are stripped
Cleaning scripts: JavaScript, CSS, and tracking code are removed
Preserving structure: Headings and paragraphs are maintained for context
Extracting titles: Page titles become document titles in search results

Robots.txt Compliance

By default, the crawler respects robots.txt rules. This means:

Pages disallowed by robots.txt are skipped
Crawl delays are honored
The crawler identifies itself as a bot

Private Content

The crawler can only access publicly available pages. For authenticated content, use direct integrations like Notion or Google Drive.

Best Practices

Start specific: Begin with a focused section rather than the entire site
Use patterns wisely: Exclude large sections you don't need (changelogs, API specs)
Monitor page count: Start with lower limits and increase as needed
Re-crawl periodically: Trigger manual syncs when source content updates
Check robots.txt: Ensure the content you need isn't blocked

Troubleshooting

No pages found

Verify the start URL is accessible in a browser
Check if robots.txt blocks crawling
Ensure the site doesn't require authentication
Try disabling "Respect robots.txt" temporarily

Too few pages crawled

Increase max depth setting
Check if include patterns are too restrictive
Verify links use standard HTML anchor tags
Some sites use JavaScript navigation (not supported)

Content not extracting properly

The site may use heavy JavaScript rendering
Content might be in iframes (not crawled)
PDF or image-only pages won't have text

Crawl taking too long

Reduce max pages limit
Lower crawl depth
Add exclude patterns for large sections

Limitations

JavaScript rendering: Single-page apps (SPAs) that require JavaScript to render content may not crawl properly
Authentication: Login-protected content cannot be accessed
Rate limiting: The crawler pauses between requests to be polite; very large sites take time
Same domain only: Links to external domains are not followed

Removing the Integration

To remove a web crawl data source, simply delete it from your Knowledge Base. All indexed pages will be removed from your bot's knowledge.