AWS Bedrock website crawler breaks on relative urls

0

When crawling a website that contains relative paths like this

href="./pageb"

Bedrock can not resolve this url properly, which leads to this warning.

Skipped document: https://www.domain.de/./pageb. Status code 404 received. The resource you are requesting doesn't exist.

I am writing here, because this used to work a few weeks ago. We are using framer.com and it's quite common to have relative urls in SPA websites (among other things in order to allow page preloading).

Can we somehow modify the parsing of url so we can normalize it ourselves?

1 Answer
0

I apologize, but there seems to be a misunderstanding in your question. The Amazon Bedrock Web Crawler does not currently have a known issue with relative URLs as you've described. The Web Crawler is designed to handle various URL formats, including relative paths.

However, if you're experiencing this specific issue, it could be due to a few factors:

  1. Website Structure: The structure of the website you're crawling might have changed, causing issues with relative URL resolution.

  2. Crawler Configuration: Your crawler configuration might need adjustment. Ensure that your seed URLs and scope settings are correctly set to capture all the pages you intend to crawl.

  3. Website Rendering: If the website heavily relies on client-side rendering (as is common with SPAs), the crawler might not be able to resolve all URLs correctly.

Unfortunately, there isn't a direct way to modify the URL parsing within the Bedrock Web Crawler. The crawler is designed to handle URL resolution internally.

To address this issue, you could try the following:

  1. Review your crawler configuration, especially the scope and filter settings.
  2. Ensure that your robots.txt file isn't inadvertently blocking certain paths.
  3. If possible, provide absolute URLs in your website structure.
  4. Consider using the inclusion filters in your crawler configuration to explicitly include patterns that match your relative URL structure.

If the problem persists, it would be advisable to contact AWS support for further assistance, as they can provide more specific guidance based on your exact setup and potentially investigate if there's been any recent change in the crawler's behavior.
Sources
Crawl web pages for your knowledge base - Amazon Bedrock
Crawl Depth for the Bedrock Knowledge Base Web Crawler data source | AWS re:Post
WebCrawlerConfiguration - Amazon Bedrock

profile picture
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions