Website & URL Crawling

Crawl websites and documentation sites to automatically index their content. Perfect for help centers, docs sites, and public knowledge bases.

How It Works

The web crawler starts from a URL you provide and follows links to discover and index pages. It automatically:

  • Extracts clean text content from HTML pages
  • Follows links within the same domain
  • Respects robots.txt directives
  • Removes navigation, headers, footers, and scripts
  • Handles pagination and multi-page content

Setting Up Web Crawling

Add Web Source

From Knowledge Base, click + Add Source Websites & URLs.

Enter Starting URL

Provide the URL where crawling should begin. This is typically a homepage, docs landing page, or sitemap.

Best Starting Points

  • Documentation index: https://docs.example.com
  • Help center: https://help.example.com
  • Specific section: https://example.com/guides/

Configure Crawl Settings

Set depth and page limits based on your needs:

  • Max depth: How many link levels to follow (1-5)
  • Max pages: Total pages to crawl (up to 500)
  • URL patterns: Include or exclude specific paths

Start Crawl

Click Connect to begin crawling. Progress is shown in real-time as pages are discovered and indexed.

Crawl Settings

Start URLstringdefault: required

The URL where crawling begins

Max depthintegerdefault: 2

How many link levels deep to crawl (1 = start page only, 5 = maximum)

Max pagesintegerdefault: 50

Maximum number of pages to crawl (up to 500)

Include patternsarraydefault: none

Regex patterns for URLs to include (optional)

Exclude patternsarraydefault: none

Regex patterns for URLs to exclude (optional)

Respect robots.txtbooleandefault: true

Honor website crawling rules

URL Patterns

Use regex patterns to control which pages are crawled. This is useful for large sites where you only need specific sections.

Include Patterns

When set, only URLs matching at least one pattern are crawled:

# Only crawl docs pages
/docs/.*

# Only crawl specific versions
/v2\.0/.*

# Only English content
/en/.*

Exclude Patterns

URLs matching any exclude pattern are skipped:

# Skip changelog and release notes
/changelog.*
/releases.*

# Skip API reference (too large)
/api-reference/.*

# Skip login/auth pages
/login.*
/signup.*

Understanding Crawl Depth

DepthWhat Gets CrawledBest For
1Start URL onlySingle page content
2Start URL + directly linked pagesSmall docs sections
3Two levels of links from startMost documentation sites
4-5Deep crawl of entire site sectionsLarge knowledge bases

Depth vs Pages

Higher depth doesn't always mean more pages. A depth of 3 on a small site might find 20 pages, while depth 2 on a large site could hit the 500 page limit.

Content Extraction

The crawler automatically extracts meaningful content by:

  • Removing boilerplate: Navigation menus, footers, sidebars, and headers are stripped
  • Cleaning scripts: JavaScript, CSS, and tracking code are removed
  • Preserving structure: Headings and paragraphs are maintained for context
  • Extracting titles: Page titles become document titles in search results

Robots.txt Compliance

By default, the crawler respects robots.txt rules. This means:

  • Pages disallowed by robots.txt are skipped
  • Crawl delays are honored
  • The crawler identifies itself as a bot

Private Content

The crawler can only access publicly available pages. For authenticated content, use direct integrations like Notion or Google Drive.

Best Practices

  • Start specific: Begin with a focused section rather than the entire site
  • Use patterns wisely: Exclude large sections you don't need (changelogs, API specs)
  • Monitor page count: Start with lower limits and increase as needed
  • Re-crawl periodically: Trigger manual syncs when source content updates
  • Check robots.txt: Ensure the content you need isn't blocked

Troubleshooting

No pages found

  • Verify the start URL is accessible in a browser
  • Check if robots.txt blocks crawling
  • Ensure the site doesn't require authentication
  • Try disabling "Respect robots.txt" temporarily

Too few pages crawled

  • Increase max depth setting
  • Check if include patterns are too restrictive
  • Verify links use standard HTML anchor tags
  • Some sites use JavaScript navigation (not supported)

Content not extracting properly

  • The site may use heavy JavaScript rendering
  • Content might be in iframes (not crawled)
  • PDF or image-only pages won't have text

Crawl taking too long

  • Reduce max pages limit
  • Lower crawl depth
  • Add exclude patterns for large sections

Limitations

  • JavaScript rendering: Single-page apps (SPAs) that require JavaScript to render content may not crawl properly
  • Authentication: Login-protected content cannot be accessed
  • Rate limiting: The crawler pauses between requests to be polite; very large sites take time
  • Same domain only: Links to external domains are not followed

Removing the Integration

To remove a web crawl data source, simply delete it from your Knowledge Base. All indexed pages will be removed from your bot's knowledge.