Using lambda to direct crawler traffic to s3 with prerendered index.html for each route.

0

Overview

I have a lambda function that attempts to redirect social media crawlers based on their user-agent header to an s3 containing prerendered html (s3 is called 'prerendered-routes'). If the user is not a crawler then they are directed to the s3 hosting my static website. I'm doing this so that I can have social media metatags prerendered –– eventually I'll figure out how to implement ssr for my SPA.

The s3 with prerendered routes has a directory structure that mirrors the url segments. So, for a route on my website like 'https://example.com/this/route', there would be a prerendered index.html in 'https://prerendered-routes.s3.amazonaws.com/this/route/index.html'

Error

The lambda function (called 'prerender') is effectively differentiating between social media crawlers and regular users based on user agent, but I am getting an access denied error, specifically when I send a get request with a bot as my user-agent:

Thank you for your response Piotrek!

I have updated my code as follows:

  • authMethod is now set to 'origin-access-identity'
  • region is set to 'us-east-1', which is the region for my prerendered-routes s3

I have configured the lambda to trigger on 'viewer request' events because I want the bot traffic to always be sent to the prerendered-routes s3, and from what I understand if I do origin request instead, the bot would receive the website s3 content for any cached routes. In that case they would not have the prerendered metatags. Is my understanding correct?

I have updated the lambda, pushed the new version to cloudfront and ran an invalidation for all files ('/*'). I'm still getting an error when I try to access my site as a bot:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> <TITLE>ERROR: The request could not be satisfied</TITLE> </HEAD><BODY> <H1>502 ERROR</H1> <H2>The request could not be satisfied.</H2> <HR noshade size="1px"> The Lambda function returned an invalid origin configuration: For an S3 origin, the value of either AuthMethod or Region is invalid. We can't connect to the server for this app or website at this time. There might be too much traffic or a configuration error. Try again later, or contact the app or website owner. <BR clear="all"> If you provide content to customers through CloudFront, you can find steps t

if I don't use a bot user agent, I have no issues accessing my website, and looking at the lambda logs I can see that it is indeed routing away from the primary s3 when it is a bot, as expected.

Here is the full lambda function:

Code

'use strict';

exports.handler = (event, context, callback) => { const request = event.Records[0].cf.request; const headers = request.headers;

const botUserAgents = [
    'googlebot',
    'twitterbot',
    'applebot',
    'facebookexternalhit',
    'linkedinbot',
    'bingbot',
    'yandex',
    'duckduckbot',
];

const userAgent = headers['user-agent'] && headers['user-agent'][0] ? headers['user-agent'][0].value.toLowerCase() : '';
console.log(`User-Agent: ${userAgent}`);
console.log(headers)
const isBot = botUserAgents.some(botAgent => userAgent.includes(botAgent));

if (isBot) {
    let newUri = request.uri.endsWith('/') ? request.uri : `${request.uri}/`;
    newUri += 'index.html';

    // Modify the request to serve from the prerendered bucket
    request.origin = {
        s3: {
            domainName: 'current-prerendered.s3.amazonaws.com',
            path: '',
            region: 'us-east-1',
            authMethod: 'origin-access-identity',
             customHeaders: {
                'X-Forwarded-Host': [{ key: 'X-Forwarded-Host', value: 'current-prerendered.s3.amazonaws.com' }]
            }
        }
    };
    request.uri = newUri;
    console.log(`Modified Request URI: ${request.uri}`);
}
console.log('Modified request:', JSON.stringify(request));

callback(null, request);

};

Configuration

The lambda function is deployed with lambda@Edge for 'viewer request' on the cloudfront dist associated with my website. I assumed the issue might be that the 'prerendered-routes' s3 was blocking requests from the cloudfront distribution, but I believe the origin policy i'm using for the s3 should allow cloudfront interactions. Here it is:

{ "Version": "2012-10-17", "Id": "PolicyForCloudFrontPrivateContent", "Statement": [ { "Sid": "1", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::cloudfront:user/CloudFront Origin Access Identity E38V2FTUZDIHJE" }, "Action": "s3:GetObject", "Resource": "arn:aws:s3:::prerendered-routes/*" } ] }

Lastly, I am using a legacy OAI for my cloudfront dist, which is linked in the above policy for the prerendered-routes s3.

The console.log statements I have set in the lambda have not helped me understand the error any better. If you have any ideas for ways to more effectively debug this I would appreciate the guidance.

Thank you.

2 Answers
0

Take a look at this blog post which discusses how you might use AWS WAF to achieve this without having to inspect the user-agent header.

AWS
EXPERT
Paul_L
answered 2 months ago
0

Hi, thanks for detailed explanation! Unfortunately, overriding the origin won't work on the viewer-request event. For that, you'll need to use the origin-request event. To make best use of caching, you will actually need to use both of these triggers for different purposes. Here's how you can approach it holistically:

  • use a function attached to the viewer-request event to detect if the request was made by a bot. The code that you wrote for that is great. Please add a custom header (for example x-bot with value true) to the request and return it for further processing by CloudFront. Tip: because you're only manipulating headers, you can stay with Lambda@Edge to implement that, or use CloudFront Functions instead.

  • make use of the newly added header in a custom Cache Policy. You can do that by creating a new custom cache policy and setting that header to be included in the cache key. This way, requests that are made by bots will be served a different version of the page than regular users. The SPA itself (files like index.html, JS, CSS and other files accessed without that magic header) will then stay cached independently from prerendered objects/pages for bots (accessed from the CloudFront cache with the magic header set earlier). Tip: in theory you could just add the user-agent header to the cache policy, but because there are so many different values of this field, it will most probably decrease the performance of the cache substantially. With just two versions of the bot flag, you're able to send two different versions of your objects (for bots and non-bots) while still ensuring a high cache hit ratio.

  • finally, use the origin-request Lambda@Edge function to set a correct origin (override the origin for requests made by bots). The origin property exists only on origin-request events - you can't set this property in the viewer-request function. That's why you need to use the second function association for that.

You can see all properties exposed by CloudFront events at Lambda event structure. You can see there that origin prop can be read and written in origin events only.

Last thing: the domainName you're using is 'current-prerendered.s3.amazonaws.com', but the example on the event structure page is using a regional s3 endpoint, such as awsexamplebucket.s3.eu-west-1.amazonaws.com. Please update the domainName field to include the region too.

Let us know if it worked for you, or if additional guidance is needed!

AWS
Piotrek
answered 2 months ago
profile picture
EXPERT
reviewed 2 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions