Limit Amazon Q web crawler to single domain

0

How can I configure an Amazon Q web crawler to index only the single domain that I specify?

I have set it to "Sync domains only". Using a source url like site.example.com. I only want to index pages like https://site.example.com/* (and https://www.site.example.com/*)

The site.example.com page includes links to urls like https://example.com/page and https://other.example.com/page, and I do not want to crawl or index those pages. However, I see those pages being crawled and indexed in the logs and it just crawls out all over the example.com domain. I have crawl depth set to 6 because out site is fairly deep.

Adding a URL Crawl Inclusion pattern like https://site.example.com/.* fails to crawl any pages.

Thanks!

Shayne
feita há 2 meses311 visualizações
1 Resposta
0

Hi Shayne.

Although I have not tried this scenario, a couple of possible options are:

  1. Use Sitemap. Per the documentation:

If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".

  1. Include/Exclude explicitly using robots.txt. From the documentation:

Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q. User-agent: amazon-QBusiness

I hope this helps.

profile pictureAWS
ESPECIALISTA
respondido há 2 meses
profile picture
ESPECIALISTA
avaliado há um mês

Você não está conectado. Fazer login para postar uma resposta.

Uma boa resposta responde claramente à pergunta, dá feedback construtivo e incentiva o crescimento profissional de quem perguntou.

Diretrizes para responder a perguntas