Limit Amazon Q web crawler to single domain

0

How can I configure an Amazon Q web crawler to index only the single domain that I specify?

I have set it to "Sync domains only". Using a source url like site.example.com. I only want to index pages like https://site.example.com/* (and https://www.site.example.com/*)

The site.example.com page includes links to urls like https://example.com/page and https://other.example.com/page, and I do not want to crawl or index those pages. However, I see those pages being crawled and indexed in the logs and it just crawls out all over the example.com domain. I have crawl depth set to 6 because out site is fairly deep.

Adding a URL Crawl Inclusion pattern like https://site.example.com/.* fails to crawl any pages.

Thanks!

Shayne
gefragt vor 2 Monaten311 Aufrufe
1 Antwort
0

Hi Shayne.

Although I have not tried this scenario, a couple of possible options are:

  1. Use Sitemap. Per the documentation:

If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".

  1. Include/Exclude explicitly using robots.txt. From the documentation:

Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q. User-agent: amazon-QBusiness

I hope this helps.

profile pictureAWS
EXPERTE
beantwortet vor 2 Monaten
profile picture
EXPERTE
überprüft vor einem Monat

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen