VPCE endpoint issues using webcrawler blueprint Synthetic Canary Lambda

0

I am using the webcrawler blueprint Synthetic Canary Lambda deployed to 2 subnets in VPC with no internet access to monitor an internal application endpoint. The canary doesn't even make it to testing the website as the initial set up code accessing the Amazon S3 (Gateway) and Cloudwatch (Interface) private VPC Endpoints is failing. Here is what is showing up in the Log group log events for a run:

ERROR Unable to fetch S3 bucket location: Inaccessible host: cw-syn-results-acountxxx-us-gov-west-1.s3.us-gov-west-1.amazonaws.com' at port undefined'. This service may not be available in the us-gov-west-1' region.. Fallback to S3 client in current region: us-gov-west-1.

ERROR Exception calling ListBuckets. Unable to determine who owns the S3 bucket: cw-syn-results- acountxxx-us-gov-west-1 UnknownEndpoint: Inaccessible host: s3.us-gov-west-1.amazonaws.com' at port undefined'. This service may not be available in the `us-gov-west-1' region.

INFO Publishing result and duration CloudWatch metrics with timestamp: Thu Dec 15 2022 23:11:05 GMT+0000 (Coordinated Universal Time) for canaryName: devlnkscheck stepName: null result: ERROR startDateTimeInUTC: Thu Dec 15 2022 23:13:48 GMT+0000 (Coordinated Universal Time) endDateTimeInUTC: Thu Dec 15 2022 23:13:48 GMT+0000 (Coordinated Universal Time)

ERROR Could not PutMetricData. Error:{ "message": "Inaccessible host: monitoring.us-gov-west-1.amazonaws.com' at port undefined'. This service may not be available in the us-gov-west-1' region.", "code": "UnknownEndpoint", "region": "us-gov-west-1", "hostname": "monitoring.us-gov-west-1.amazonaws.com", "retryable": true, "originalError": { "message": "getaddrinfo EAI_AGAIN monitoring.us-gov-west-1.amazonaws.com", "errno": -3001, "code": "NetworkingError", "syscall": "getaddrinfo", "hostname": "monitoring.us-gov-west-1.amazonaws.com", "region": "us-gov-west-1", "retryable": true, "time": "2022-12-15T23:15:09.284Z" }, "time": "2022-12-15T23:15:09.284Z" }

More than likely I have the same issue for both aws services- that the lambda, even though it has all the correct policies on the role, can't access the endpoints because of a networking issue. The IAM role for the canary, the security group used, the subnet ACLs, and the VPC and VPCE endpoint settings look good/according to documentation. The security group on the canary has inbound and outbound 443 and 0.0.0.0/0 source and destination respectively as well as the pl- S3 endpoint (com.amazonaws.us-gov-west-1.s3).

I'll just focus on CW Interface endpoint troubleshooting I tried which made me think some other networking issue is going on. The CW Monitoring VPCE's private dns is monitoring.us-gov-west-1.amazonaws.com. Using dig on a server in the vpc, this resolves to a 52.xx.xxx.xxx AWS IP. Using dig for the DNS name of vpce-0a6a9286xxxxxxxxxxxxxxx.monitoring.us-gov-west-1.vpce.amazonaws.com resolves to 2 private IPs in the subnet CIDRs, one in each subnet. On a server in the subnets, I can use the https://0a6a9286xxxxxxxxxxxxxxx.monitoring.us-gov-west-1.vpce.amazonaws.com endpoint URL to query cloudwatch list-metrics but I cannot use https://monitoring.us-gov-west-1.amazonaws.com as the endpoint URL as it times out.

Maybe this is a workaround for the real issue of not being able to use the private DNS but I am wondering if there is a way to specify the VPCE endpoint URL for CloudWatch monitoring in either the lambda and/or canary webcrawler code to use https://0a6a9286xxxxxxxxxxxxxxx.monitoring.us-gov-west-1.vpce.amazonaws.com instead of monitoring.us-gov-west-1.amazonaws.com.

1 Answer
0
Accepted Answer

I definitely think that DNS is the issue here. Your Lambda function needs to be able to resolve the endpoint IP addresses and DNS is the way to go. You can confirm this by creating an EC2 instance and launching it on the same subnet that the Lambda function(s) are using.

In short: Not having DNS is going to be a big headache for using just about everything.

profile pictureAWS
EXPERT
answered a year ago
profile picture
EXPERT
reviewed 3 days ago
  • So yes unfortunately after much troubleshooting, I asked our platform management group. The VPCs have a custom DHCP Option set that does not have routing configured to the AWSProvidedDNS server so the private DNS endpoints don't resolve. I am currently trying to configure the lambda to override the private dns URL and use the VPCE for CloudWatch monitoring instead but up against a learning curve with customizing the canary. Also I think this workaround will not always apply in other cases that I have observed timeout issues. Followup question - I can create a DHCP Option set with both the Amazon DNS as well as the current agency specific domain name servers but the documentation suggests that this may cause unexpected routing behaviors (https://docs.aws.amazon.com/vpc/latest/userguide/DHCPOptionSet.html). Without knowing any specifics of the platform networking, do you know in general why that would be?

    "Domain name servers (optional): Enter the DNS servers that will be used to resolve the IP address of a host from the host's name. You can enter either AmazonProvidedDNS or custom domain name servers. Using both might cause unexpected behavior. You can enter the IP addresses of up to four IPv4 domain name servers (or up to three IPv4 domain name servers and AmazonProvidedDNS)"

  • In general, adding multiple DNS servers to the DHCP option set (or manually in a resolver configuration file) won't help. Most operating systems use the first working DNS server that they find. So if you have three DNS servers (A, B and C) all of which resolve different things; the operating system will use A until A doesn't respond and only then move onto B; and only then move to A or C (it's up to the OS which it choose next). So if you add B because it resolves a few different things to A - in general - it'll never get used (assuming A responds all the time). And that includes a query to A where A responds with "host or domain unknown" - that's a valid response to the OS thinks A is doing fine. It won't then go on and query B. The way to solve this is with a single resolver (A) that always resolves everything that you expect it to.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions