Skip to content

What do I do if AWS resources are used to crawl my website?

3 minute read
0

I want to prevent AWS resources from being used to crawl my website.

Short description

Create or modify your robots.txt file to protect your website against crawlers. The robots.txt file is the accepted standard that regulates web crawler activity.

Modify your robots.txt file to impact the following:

  • Which crawlers can crawl your website.
  • Which pages the crawlers can crawl.
  • The rate at which pages can be crawled.

For more information on the robots.txt file and system, see What is robots.txt on the Cloudflare.com website.

Resolution

If you don't have a robots.txt file associated with your website, then use a text editor to create a new file. Name the file robots.txt. Otherwise, open your robots.txt file.

Disallow a specific web crawler

Check your logs for the User-agent name of the crawlers that you want to stop. To block that crawler from crawling any pages in your domain, add the User-agent name to your robots.txt file:

User-agent: crawler
Disallow: /

Note: Replace crawler with the User-agent name of the crawler.

Manage multiple crawlers

You can define different rules for each crawler in a new text block. The following example blocks crawler1 from crawling your page at all, but allows crawler2 to crawl your page at a reduced rate:

User-agent: crawler1
Disallow: /
User-agent: crawler2
Crawl-delay: 60

This argument lets crawler2 crawl your domain, but only at a rate of once every 60 milliseconds.

Block all crawlers

If you want to block all crawlers from your web content, then use a wildcard character:

User-agent: *
Disallow: /

Note: Search engines use crawlers to index pages for use in search results. If you block all crawlers from your website, then your page will be harder for users to find.

Control which directory a crawler can access

You can define rules to specify which directories or pages can be crawled by the crawlers. The following example blocks crawler from crawling directory1 and directory2, except for the example.html page inside directory2:

User-agent: crawler
Disallow: /directory1/
Disallow: /directory2/
Allow: /directory2/example.html

Note: Replace directory1 and directory2 with the names of your directories. Replace example.html with the name of your page.

Add your robots.txt file to your domain

Add the robots.txt file to your root domain. For example, if your domain is example.com, then add the file in the following path:

www.example.com/robots.txt

Contact AWS Abuse

Malicious crawlers ignore your robots.txt file. If you believe that a crawler running on AWS resources ignores your robots.txt file, then submit an abuse report with complete logs. These logs must include the date, timestamp (including time zone), and the source IP address of the crawler activity. Be aware that the AWS Trust and Safety team must review your robots.txt file to confirm the non-compliance of the implicated customer.

Related information

How do I report abuse of AWS resources?