How do I troubleshoot Lambda triggers that poll from Amazon MSK and self-managed Apache Kafka clusters?

13 minute read
3

I designed my AWS Lambda function to process records from my Amazon Managed Streaming for Apache Kafka (Amazon MSK) cluster or self-managed Apache Kafka cluster. However, the event source mapping doesn't invoke my Lambda function.

Short description

To invoke a Lambda function, the Apache Kafka event source mapping must be able to perform the following actions:

If an event source mapping's networking, authentication, or authorization settings prevent the preceding actions, then the event source mapping can't invoke the function. Instead, you receive an error.

Resolution

After you configure a Lambda function with an Amazon MSK trigger or a self-managed Kafka trigger, Lambda automatically creates a new event source mapping resource. This event source mapping is a separate resource from the Lambda function. The event source mapping polls records from the Kafka cluster and bundles the records into a payload. Then, the Lambda Invoke API is called to deliver the payload to your Lambda function to be processed. To troubleshoot issues with failed polls, complete the following troubleshooting steps for each error that you receive.

Important: Lambda event source mappings don't inherit the virtual private cloud (VPC) network configuration of the Lambda function. This is true for both Amazon MSK and self-managed Kafka triggers. An Amazon MSK event source mapping uses the subnet and security group configurations that you configured on the target MSK cluster. A self-managed Kafka trigger has wide area network (WAN) access by default. However, you can also configure network access to a VPC in the same AWS account and AWS Region. Because the network configuration is separated, you can configure your Lambda function within a network that doesn't have a route to the Kafka cluster.

To configure an Amazon MSK event source mapping to poll records from a cross-account MSK cluster, set up multi-VPC private connectivity. Note that you can create a self-managed Kafka trigger that consumes from an MSK cluster in another account. However, there are downsides to this solution. For example, you can't use AWS Identity and Access Management (IAM) authentication with a self-managed Kafka trigger, even when the target cluster is an MSK cluster. Also, to connect to the MSK cluster over a VPC-peered connection, you must set up VPC workarounds. For an example architecture, see How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink.

Note: To use multi-VPC connectivity, verify that you adhere to the requirements.

Troubleshoot network communication issues between the trigger and the cluster

The event source mapping sends multiple different requests to the cluster broker endpoints to complete a single invocation of your Lambda function. Before an invocation, the event source mapping asks the cluster broker endpoints for the cluster metadata information and the records from the topic. After a successful invocation, the event source mapping communicates with the broker endpoints to commit the processed records. When the event source mapping sends a request to the broker endpoints and doesn't receive a response, the request times out. You receive the following error:

"PROBLEM: Connection error. Please check your event source connection configuration. If your event source lives in a VPC, try setting up a new Lambda function or EC2 instance with the same VPC, Subnet, and Security Group settings. Connect the new device to the Kafka cluster and consume messages to ensure that the issue is not related to VPC or Endpoint configuration. If the new device is able to consume messages, please contact Lambda customer support for further investigation."

Broker requests that time out can occur before or after the request reaches the broker endpoint. Pre-broker timeouts occur when network and security group settings block the event source mapping's requests to the broker endpoints. Post-broker timed out requests occur when the broker receives the event source mapping's request, but it can't complete the request.

To investigate a post-broker timed out request, check the broker status at the time of failure. If the cluster was offline when the issue occurred, then reactivate the event source mapping when the cluster is back online and available. Timed out requests also occur when the cluster is out of disk space or it reaches 100% CPU usage, or when a broker endpoint fails. To resolve these issues, set the event source mapping's batch size to 1, and then re-activate the trigger. Note that when you set the batch size to a higher value, you experience longer cluster response time.

To troubleshoot timed out errors, examine the broker's access logs and system logs for more information.

If the request times out before that request reaches the broker endpoint, then check your networking configuration.

Check the networking configuration for an Amazon MSK event source mapping

To communicate with the MSK cluster, the Amazon MSK event source mapping creates a hyperplane elastic network interface in each subnet that the cluster uses. The event source mapping is a Lambda owned resource. However, the Amazon MSK event source mapping doesn't use the Lambda function's VPC settings. Instead, the event source mapping automatically uses the subnet and security group settings that are configured on the target MSK cluster. The Amazon MSK event source mapping then creates a network interface inside each subnet that the MSK cluster uses. These network interfaces use the same security group that's used by the MSK cluster.

To check whether your security groups allow the required traffic and ports, complete the following steps:

  1. To list all security groups and subnets that the MSK cluster uses, run the describe-cluster AWS CLI command.
  2. To show all inbound and outbound rules, run the describe-security-groups command on the security groups listed in the output of the describe-cluster command.
  3. Configure the rules in the listed security groups to allow traffic between the security group and the MSK cluster. You must also allow traffic over the following open authentication ports that the broker uses:
    9092 for plaintext
    9094 for TLS
    9096 for SASL
    9098 for IAM
    443 (outbound rule) for all configurations

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshooting errors for the AWS CLI. Also, make sure that you're using the most recent AWS CLI version.

Check the networking configuration for a self-managed Kafka event source mapping

By default, a self-managed Kafka event source mapping can access the WAN but can't access the VPC. You can manually configure VPC access to specific subnets and security groups in the Kafka cluster. However, the event source mapping can access clusters only if they're in the account that contains the Lambda function. You can create a self-managed Kafka event source mapping for a Kafka cluster that's in one of the following locations:

  • An on-premises data center
  • Another cloud provider
  • The Amazon MSK brokers of a Kafka cluster that's located in the VPC of a different account

Troubleshoot issues that occur during initialization, polling, or invocation

If you experience issues during initialization, polling, or invocation, then you receive the following error:

"PROBLEM: Connection error. Your event source VPC must be able to connect to Lambda and STS, Secrets Manager (if event source authentication is required), and the OnFailure Destination (if one is configured). You can provide access by configuring PrivateLink or a NAT Gateway. For how to setup VPC endpoints/NAT gateway, please check https://aws.amazon.com/blogs/compute/setting-up-aws-lambda-with-an-apache-kafka-cluster-within-a-vpc/".

The preceding error occurs for any of the following reasons:

  • The event source mapping is configured in a VPC, and calls to the AWS STS API fail or timeout.
  • The event source mapping is configured to use Secrets Manager cluster authentication, but calls to the Secrets Manager API fail or timeout.
  • The event source mapping can access your Kafka cluster and poll records successfully, but calls to the Lambda API fail or time out.
  • You configured the event source mapping with an on-failure destination, such as Amazon Simple Storage Service (Amazon S3) or Amazon Simple Notification Service (Amazon SNS). However, when your function invocations end with an error, calls to the API of the on-failure destination fail or time out.

The preceding issues occur when the configuration of security groups or routing tables doesn't allow your event source mapping to reach other services. These services include AWS STS, Lambda, or AWS Secrets Manager. To correctly configure your VPC settings, complete the steps in Setting up AWS Lambda with an Apache Kafka cluster within a VPC.

To resolve these issues for a self-managed Kafka event source mapping, take the following actions:

  • Create a Lambda VPC endpoint and STS VPC endpoint in the VPC that contains the subnets that the self-managed Kafka event source mapping uses.
  • If you configured the event source mapping with a secret, then create a VPC endpoint for Secrets Manager.
  • If you configured the event source mapping with an on-failure destination, then create a VPC endpoint for your on-failure destination. Example destinations include Amazon SNS or Amazon S3.
  • Configure the VPC endpoints with a security group that allows inbound traffic on port 443 from the self-managed Kafka event source mapping's security group.
  • Configure the self-managed Kafka event source mapping's security group to allow outbound traffic on port 443 to the VPC endpoints' security group.

To resolve these issues for an Amazon MSK event source mapping, take the following actions:

  • Create a Lambda VPC endpoint and STS VPC endpoint in the VPC that contains the MSK cluster.
  • If the event source mapping uses a secret, or the cluster uses SASL IAM authentication, then create a VPC endpoint for Secrets Manager. This endpoint must be in the VPC that contains the MSK cluster.
  • If you configured the event source mapping with an on-failure destination, then create a VPC endpoint for your on-failure destination. Example destinations include Amazon SNS or Amazon S3. This VPC endpoint must be in the VPC that contains the MSK cluster.
  • Configure the VPC endpoints with a security group that allows inbound traffic on port 443 from the security group that the MSK cluster uses.
    Important: Allow inbound traffic from the MSK cluster security group, not the Lambda function's security group.
  • Configure the MSK cluster's security group to allow outbound traffic on port 443 to the VPC endpoints' security group.

Troubleshoot issues with your VPC policy or execution role

Check for issues with the execution role

If there are configuration issues in your STS VPC endpoint resource policy, then you receive the following error:

"PROBLEM: Lambda failed to assume your function execution role."

To resolve this issue, take the following actions:

  • Make sure that the lambda.amazonaws.com service principal is listed as a trusted service in the IAM role's trust policy.
  • Make sure that the STS VPC endpoint policy allows the Lambda service principal to call the sts:AssumeRole. For more information about how to configure your VPC, see Configure network security.

If you have a restrictive VPC endpoint policy for your Lambda VPC endpoint, then you receive the following error:

"No VPC endpoint policy allows the lambda:InvokeFunction action"

To resolve these issues, make sure that the Lambda VPC endpoint policy allows the Lambda service principal to call lambda:InvokeFunction.

Note: You can configure on-failure destinations to an Amazon Simple Queue Service (Amazon SQS) queue, an Amazon SNS topic, or an Amazon S3 bucket. When you use these destinations, make sure that the VPC endpoint policy allows the required actions from the Lambda execution role.

Check for issues with access to secrets

If you have issues with access to the Secrets Manager secret, then you receive the following error:

"PROBLEM: Lambda is unable to call secretsmanager:GetSecretValue. Reason: User: Lambda execution role is not authorized to perform: secretsmanager:GetSecretValue on resource: Secret in Secrets Manager with an explicit deny in a VPC endpoint policy."

To resolve this issue, make sure that the VPC endpoint resource policy allows the Lambda execution role to call secretsmanager:GetSecretValue for the secret. To get secrets from Secrets Manager, Lambda uses your execution role, not the Lambda service principal.

Troubleshoot issues with your secrets

Check the configuration of your secret

If your secret isn't in a format that the event source mapping can use, then you receive the following error:

"PROBLEM: Certificate and/or private key must be in PEM format."

To resolve this issue, make sure that your key is in a .pem format in an X.509 certificate file. To verify that your certificate is in the right format, run the following command:

openssl x509 -in PEM FILE -text

Note: Replace PEM FILE with your .pem file name.

Also, make sure that the private key encryption uses a PBES1 algorithm, not a PBES2 algorithm.

For more information, see Provided certificate or private key is not valid for Amazon MSK or Configuring the client certificate secret for self-managed Kafka.

Check the format of your secret values

If your server login tries fail, then you receive the following error:

"PROBLEM: SASL authentication failed."

When an Amazon MSK topic invokes a Lambda function, the function can access usernames and passwords that Secrets Manager secures with SASL/SCRAM. If Lambda doesn't recognize your username and password as valid, then you encounter the preceding error. To resolve this issue, log in to the broker, and then check the access logs. For more information, see SASL/SCRAM authentication for Amazon MSK or SASL/SCRAM authentication for self-managed Kafka.

Troubleshoot issues with your event source mapping's server settings

Make sure that the event source mapping can reach your DNS server

If your event source mapping can't turn the hostname into an IP address, then you receive the following error:

"PROBLEM: The provided Kafka broker endpoints cannot be resolved."

To resolve this issue, make sure that the event source mapping can reach the DNS server that translates the hostname. If the endpoint's hostname is in a private network, then configure the event source mapping to use a VPC with DNS settings that can resolve hostnames.

Check the configuration of your event source mapping server settings

If the server is different from the server that you configured in the event source mapping settings, then you receive the following error:

"PROBLEM: Server failed to authenticate Lambda or Lambda failed to authenticate server."

To resolve this issue, verify that the server hostname in your settings matches the internal server name of the server that you're connecting to.

Verify that the event source mapping has permissions to poll records from the cluster's topic

If the event source mapping doesn't have access to poll records, then you receive the following error:

"PROBLEM: Cluster failed to authorize Lambda."

To resolve this issue, configure the required permissions to authorize Lambda for your MSK cluster or self-managed Kafka cluster.

Related information

Authentication and authorization errors

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago