Limit Amazon Q web crawler to single domain

0

How can I configure an Amazon Q web crawler to index only the single domain that I specify?

I have set it to "Sync domains only". Using a source url like site.example.com. I only want to index pages like https://site.example.com/* (and https://www.site.example.com/*)

The site.example.com page includes links to urls like https://example.com/page and https://other.example.com/page, and I do not want to crawl or index those pages. However, I see those pages being crawled and indexed in the logs and it just crawls out all over the example.com domain. I have crawl depth set to 6 because out site is fairly deep.

Adding a URL Crawl Inclusion pattern like https://site.example.com/.* fails to crawl any pages.

Thanks!

Shayne
preguntada hace 2 meses311 visualizaciones
1 Respuesta
0

Hi Shayne.

Although I have not tried this scenario, a couple of possible options are:

  1. Use Sitemap. Per the documentation:

If you want to crawl a sitemap, check that the base or root URL is the same as the URLs listed on your sitemap page. For example, if your sitemap URL is https://example.com/sitemap-page.html, the URLs listed on this sitemap page should also use the base URL "https://example.com/".

  1. Include/Exclude explicitly using robots.txt. From the documentation:

Amazon Q Web Crawler respects standard robots.txt directives like Allow and Disallow. You can modify the robot.txt file of your website to control how Amazon Q Web Crawler crawls your website. Use the user-agent to make entries designed for Amazon Q. User-agent: amazon-QBusiness

I hope this helps.

profile pictureAWS
EXPERTO
respondido hace 2 meses
profile picture
EXPERTO
revisado hace un mes

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas