- Newest
- Most votes
- Most comments
The Amazon crawler always contains the string "Amazonbot" in the user-agent that it presents when it crawls your site. Typically the user-agent will look as follows:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML\, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)
Amazonbot will honor robots.txt
so you can control if it crawls your site or not as follows:
User-agent: Amazonbot # Amazon's user agent
Disallow: /do-not-crawl/ # disallow this directory
User-agent: * # any robot
Disallow: /not-allowed/ # disallow this directory
You can verify that a crawler that claims to be Amazonbot is actually the the official Amazonbot using one of the following techniques:
- Locate the accessing IP address from your server logs
- Use the host command to run a reverse DNS lookup on the IP address
- Verify the retrieved domain name is a subdomain of
crawl.amazonbot.amazon
- Use the host command to run a forward DNS lookup on the retrieved domain name
- Verify the returned IP address is identical to the original IP address from your server logs
So for example:
$ host 12.34.56.789
789.56.34.12.in-addr.arpa domain name pointer 12-34-56-789.crawl.amazonbot.amazon.
$ host 12-34-56-789.crawl.amazonbot.amazon
12-34-56-789.crawl.amazonbot.amazon has address 12.34.56.789
Hi, just want to add that you should never handle bots based on their user-agent or header. It's not a consistent way, even more when you are concerned about "bad" bots, which can easily impersonate a valid crawler user-agent. However, as recommended in the answer above you can handle them based on whether the bot/crawler respects your robots.txt guidelines. A simple bots trap for this, is specifying a path in your robots.txt that should never be indexed. Than, you can assume bots dismissing that order are not legit crawlers and act in consequence.
Furthermore, you could:
- Add a rate-limit rule with challenge or captcha actions. This is a more integral approach that goes beyond the type of bots and will help you making sure any bots will not cause a big impact in your site's performance. There is a con here, if you excessively block valid crawlers it might impact in your site's SEO, so the recommendation is always start from understanding normal patterns in your traffic and how important is SEO against Security.
- You can also implement Bot Control, which uses Machine Learning in order to identify even bots that are not self-identifying properly. Bot Control has costs you need to consider, so you might want to set the Bot Control rules at the bottom, in order to reduce the number of requests inspected by it, as well as the number of challenges/captchas presented. You can find the official documentation for Bot Control here, and the pricing here.
Relevant content
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago