What do I do if AWS resources are being used to crawl my website?

3 minutos de lectura
0

AWS resources are being used to crawl my website. What do I do?

Short description

It's a best practice to protect your website against crawlers by creating or modifying your robots.txt file. The robots.txt file is a generally accepted standard for regulating web crawler activity.

By modifying your robots.txt file, you can impact the following:

  • Which crawlers can crawl your website.
  • Which pages the crawlers can crawl.
  • The rate at which pages can be crawled.

If a crawler running on AWS resources isn't abiding by your robots.txt file, submit an abuse report.

Resolution

  1. Create or modify the robots.txt file

The robots.txt file lists all restrictions in place for crawlers. This file can stop or slow down crawlers when attached to the root domain of a website.

Check your logs for the User-agent name of the crawlers that you want to stop. To block that crawler from crawling any pages in your domain, add the User-agent name to your robots.txt file:

User-agent: crawler
Disallow: /

Note: Replace crawler with the User-agent name of the crawler.

You can define different rules for each crawler in a new text block. For example, assume that you want to block crawler1 from crawling your page at all. But you also want to allow crawler2 to crawl your page at a reduced rate:

User-agent: crawler1
Disallow: /

User-agent: crawler2
Crawl-delay: 60

Note: Replace crawler1 and crawler2 with the User-agent names of the crawlers.

crawler2 now is allowed to crawl your domain, but only at a rate of once every 60 milliseconds.

To block all crawlers from your web content, use a wildcard character:

User-agent: *
Disallow: /

Note: Many search engines use crawlers to index pages for use in search results. Blocking all crawlers from crawling your website can make your page harder for users to find.

You can define rules to specify which directories or pages can be crawled by the crawlers. For example, assume that you want to block crawler from crawling directory1 and directory2 except for a page example.html inside directory2:

User-agent: crawler
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/example.html

Note: Replace crawler with the User-agent name of the crawler.
Replace directory1 and directory2 with the names of your directories.
Replace example.html with the name of your page.

  1. Add your robots.txt file to your domain

Add the robots.txt file to your root domain. For example, if your domain is example.com, then add the file in the following path:

www.example.com/robots.txt
  1. Contact AWS Abuse

Malicious crawlers might ignore your robots.txt file. If you believe that a crawler running on AWS resources isn't abiding by your robots.txt file, submit an abuse report with complete logs. These logs must include the date, timestamp (including time zone), and the source IP address of the crawling activity. Be aware that the AWS Trust and Safety team must review your robots.txt file to confirm the non-compliance of the implicated customer.


Related information

How do I report abuse of AWS resources?

OFICIAL DE AWS
OFICIAL DE AWSActualizada hace 4 años