Web crawling on public site shows 403 Error in Kendra

1

I tried to index "www.infosys.com" this website on Kendra using web crawler. I was getting the below error message from cloud watch logs.

INFO: { "DocumentID": "https://www.infosys.com", "DocumentHashedID": "1a537edf7325d606eedf964aec7dae443234ccfd075ac86195a9273c8ad526be", "ErrorCode": "site-not-found", "ErrorMessage": "The seed url '1a537edf7325d606eedf964aec7dae443234ccfd075ac86195a9273c8ad526be' couldn't be crawled. Response status code: 403" }

but there was no restriction in infosys.com/robots.txt file.

I could crawl the same website 2 months back. But now it is throwing error. I tried to use custom python library to scrape, it could extract text from infosys website and the status code was 200

how to resolve this issue?

Karthik
asked a year ago680 views
1 Answer
1

The website might be implementing IP blocking or rate limiting measures, which prevent automated access to the site or limit the number of requests from a particular IP address. This could be a reason why you are receiving a 403 error when crawling with Kendra but can access the site using a custom Python library. Some websites enforce restrictions on the User-Agent header in HTTP requests. If Kendra's web crawler uses a specific User-Agent that is blocked or restricted by the website, it could result in a 403 error. In contrast, your custom Python library may use a different User-Agent that is allowed by the website.

So you can

Ensure that the IP address used by Kendra's web crawler is not being blocked or limited by the website. If necessary, consider contacting the website administrator to whitelist the IP address used by Kendra.If the website enforces User-Agent restrictions, you can try modifying the User-Agent header sent by Kendra's web crawler to mimic a different user agent. This may involve customizing the settings or configuration in Kendra to modify the User-Agent value.

profile picture
EXPERT
answered a year ago
  • There is no option in Kendra to modify user-agent value and how to get IP address of the web crawler, it won't be publicly disclosed right. Are there any other option to let Kendra crawl the website?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions