Limit Amazon Q web crawler to single domain

0

How can I configure an Amazon Q web crawler to index only the single domain that I specify?

I have set it to "Sync domains only". Using a source url like site.example.com. I only want to index pages like https://site.example.com/* (and https://www.site.example.com/*)

The site.example.com page includes links to urls like https://example.com/page and https://other.example.com/page, and I do not want to crawl or index those pages. However, I see those pages being crawled and indexed in the logs and it just crawls out all over the example.com domain. I have crawl depth set to 6 because out site is fairly deep.

Adding a URL Crawl Inclusion pattern like https://site.example.com/.* fails to crawl any pages.

Thanks!

Shayne
asked 2 months ago288 views
1 Answer
0

Hi Shayne.

Although I have not tried this scenario, a couple of possible options are:

  1. Use Sitemap. Per the documentation:

If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".

  1. Include/Exclude explicitly using robots.txt. From the documentation:

Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q. User-agent: amazon-QBusiness

I hope this helps.

profile pictureAWS
EXPERT
answered 2 months ago
profile picture
EXPERT
reviewed 25 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions