How to Identify Amazonbot

0

I am analysing my waf logs and i want to ignore any requests coming from Amazon's web crawling bots Could someone help me with that Well i tried to reverse and forward DNS look up to verify an ip with below steps Running a reverse DNS lookup on the IP address using the host command. Verifying that the retrieved domain name is a subdomain of crawl.amazonbot.amazon. Running a forward DNS lookup on the retrieved domain name. Verifying that the returned IP address is identical to the original IP address

However i am sceptical about the accuracy, there is a chance of false positive right!! An attacker could potentially set the reverse DNS of an IP address to point to a domain like .crawl.amazonbot.amazon. to impersonate Amazonbot. While the forward DNS lookup would confirm that the domain name resolves back to the original IP address, this does not guarantee that the IP address is not being spoofed or that it is not being used maliciously

2 Answers
1

The Amazon crawler always contains the string "Amazonbot" in the user-agent that it presents when it crawls your site. Typically the user-agent will look as follows:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML\, like Gecko) Version/8.0.2 Safari/600.2.5 (Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot)

Amazonbot will honor robots.txt so you can control if it crawls your site or not as follows:

User-agent: Amazonbot             # Amazon's user agent
Disallow: /do-not-crawl/          # disallow this directory

User-agent: *                     # any robot
Disallow: /not-allowed/           # disallow this directory

You can verify that a crawler that claims to be Amazonbot is actually the the official Amazonbot using one of the following techniques:

  1. Locate the accessing IP address from your server logs
  2. Use the host command to run a reverse DNS lookup on the IP address
  3. Verify the retrieved domain name is a subdomain of crawl.amazonbot.amazon
  4. Use the host command to run a forward DNS lookup on the retrieved domain name
  5. Verify the returned IP address is identical to the original IP address from your server logs

So for example:

$ host 12.34.56.789
789.56.34.12.in-addr.arpa domain name pointer 12-34-56-789.crawl.amazonbot.amazon.

$ host 12-34-56-789.crawl.amazonbot.amazon
12-34-56-789.crawl.amazonbot.amazon has address 12.34.56.789
AWS
EXPERT
answered 5 months ago
profile picture
EXPERT
reviewed 5 months ago
0

Hi, just want to add that you should never handle bots based on their user-agent or header. It's not a consistent way, even more when you are concerned about "bad" bots, which can easily impersonate a valid crawler user-agent. However, as recommended in the answer above you can handle them based on whether the bot/crawler respects your robots.txt guidelines. A simple bots trap for this, is specifying a path in your robots.txt that should never be indexed. Than, you can assume bots dismissing that order are not legit crawlers and act in consequence.

Furthermore, you could:

  • Add a rate-limit rule with challenge or captcha actions. This is a more integral approach that goes beyond the type of bots and will help you making sure any bots will not cause a big impact in your site's performance. There is a con here, if you excessively block valid crawlers it might impact in your site's SEO, so the recommendation is always start from understanding normal patterns in your traffic and how important is SEO against Security.
  • You can also implement Bot Control, which uses Machine Learning in order to identify even bots that are not self-identifying properly. Bot Control has costs you need to consider, so you might want to set the Bot Control rules at the bottom, in order to reduce the number of requests inspected by it, as well as the number of challenges/captchas presented. You can find the official documentation for Bot Control here, and the pricing here.
AWS
xavi
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions