- Newest
- Most votes
- Most comments
Hi Shayne.
Although I have not tried this scenario, a couple of possible options are:
- Use Sitemap. Per the documentation:
If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".
- Include/Exclude explicitly using robots.txt. From the documentation:
Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q.
User-agent: amazon-QBusiness
I hope this helps.
You need to use a regular expression to only crawl the domain you're interested in. That way, the crawler would ignore the pages outside the domain you want to crawl.
This blog post has a couple of examples.
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated a year ago
