Web crawling on public site shows 403 Error in Kendra

1

I tried to index "www.infosys.com" this website on Kendra using web crawler. I was getting the below error message from cloud watch logs.

INFO: { "DocumentID": "https://www.infosys.com", "DocumentHashedID": "1a537edf7325d606eedf964aec7dae443234ccfd075ac86195a9273c8ad526be", "ErrorCode": "site-not-found", "ErrorMessage": "The seed url '1a537edf7325d606eedf964aec7dae443234ccfd075ac86195a9273c8ad526be' couldn't be crawled. Response status code: 403" }

but there was no restriction in infosys.com/robots.txt file.

I could crawl the same website 2 months back. But now it is throwing error. I tried to use custom python library to scrape, it could extract text from infosys website and the status code was 200

how to resolve this issue?

Karthik
質問済み 1年前706ビュー
1回答
1

The website might be implementing IP blocking or rate limiting measures, which prevent automated access to the site or limit the number of requests from a particular IP address. This could be a reason why you are receiving a 403 error when crawling with Kendra but can access the site using a custom Python library. Some websites enforce restrictions on the User-Agent header in HTTP requests. If Kendra's web crawler uses a specific User-Agent that is blocked or restricted by the website, it could result in a 403 error. In contrast, your custom Python library may use a different User-Agent that is allowed by the website.

So you can

Ensure that the IP address used by Kendra's web crawler is not being blocked or limited by the website. If necessary, consider contacting the website administrator to whitelist the IP address used by Kendra.If the website enforces User-Agent restrictions, you can try modifying the User-Agent header sent by Kendra's web crawler to mimic a different user agent. This may involve customizing the settings or configuration in Kendra to modify the User-Agent value.

profile picture
エキスパート
回答済み 1年前
  • There is no option in Kendra to modify user-agent value and how to get IP address of the web crawler, it won't be publicly disclosed right. Are there any other option to let Kendra crawl the website?

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ