Limit Amazon Q web crawler to single domain

0

How can I configure an Amazon Q web crawler to index only the single domain that I specify?

I have set it to "Sync domains only". Using a source url like site.example.com. I only want to index pages like https://site.example.com/* (and https://www.site.example.com/*)

The site.example.com page includes links to urls like https://example.com/page and https://other.example.com/page, and I do not want to crawl or index those pages. However, I see those pages being crawled and indexed in the logs and it just crawls out all over the example.com domain. I have crawl depth set to 6 because out site is fairly deep.

Adding a URL Crawl Inclusion pattern like https://site.example.com/.* fails to crawl any pages.

Thanks!

Shayne
질문됨 2달 전311회 조회
1개 답변
0

Hi Shayne.

Although I have not tried this scenario, a couple of possible options are:

  1. Use Sitemap. Per the documentation:

If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".

  1. Include/Exclude explicitly using robots.txt. From the documentation:

Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q. User-agent: amazon-QBusiness

I hope this helps.

profile pictureAWS
전문가
답변함 2달 전
profile picture
전문가
검토됨 한 달 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠