- Newest
- Most votes
- Most comments
I apologize, but there seems to be a misunderstanding in your question. The Amazon Bedrock Web Crawler does not currently have a known issue with relative URLs as you've described. The Web Crawler is designed to handle various URL formats, including relative paths.
However, if you're experiencing this specific issue, it could be due to a few factors:
-
Website Structure: The structure of the website you're crawling might have changed, causing issues with relative URL resolution.
-
Crawler Configuration: Your crawler configuration might need adjustment. Ensure that your seed URLs and scope settings are correctly set to capture all the pages you intend to crawl.
-
Website Rendering: If the website heavily relies on client-side rendering (as is common with SPAs), the crawler might not be able to resolve all URLs correctly.
Unfortunately, there isn't a direct way to modify the URL parsing within the Bedrock Web Crawler. The crawler is designed to handle URL resolution internally.
To address this issue, you could try the following:
- Review your crawler configuration, especially the scope and filter settings.
- Ensure that your robots.txt file isn't inadvertently blocking certain paths.
- If possible, provide absolute URLs in your website structure.
- Consider using the inclusion filters in your crawler configuration to explicitly include patterns that match your relative URL structure.
If the problem persists, it would be advisable to contact AWS support for further assistance, as they can provide more specific guidance based on your exact setup and potentially investigate if there's been any recent change in the crawler's behavior.
Sources
Crawl web pages for your knowledge base - Amazon Bedrock
Crawl Depth for the Bedrock Knowledge Base Web Crawler data source | AWS re:Post
WebCrawlerConfiguration - Amazon Bedrock
Relevant content
- asked 3 months ago
- AWS OFFICIALUpdated 6 months ago