How do I resolve the error "Failed to start the job flow due to an internal error" in Amazon EMR?

3 minute read
0

My Amazon EMR cluster fails to launch, and I get the error message "Failed to start the job flow due to an internal error."

Short description

Internal errors are often resolved quickly. Retry your request. If the problem persists, confirm that the cluster's networking and security settings are configured correctly.

Resolution

Open the Amazon EMR console, and then try launching the cluster again. If you still get the "Failed to start the job flow due to an internal error" message, verify the the following settings.

Permissions for the Amazon EMR service role

Security configurations that encrypt Amazon Elastic Block Store (Amazon EBS) root device and storage volumes require the relevant permissions. For these configurations, be sure that the Amazon EMR service role (EMR_DefaultRole) has permissions to use the specified AWS Key Management Service (AWS KMS) key.

The service role needs these permissions to launch EMR clusters successfully:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "EmrDiskEncryptionPolicy",
    "Effect": "Allow",
    "Action": [
      "kms:Encrypt",
      "kms:Decrypt",
      "kms:ReEncrypt*",
      "kms:CreateGrant",
      "kms:GenerateDataKeyWithoutPlaintext",
      "kms:DescribeKey"
    ],
    "Resource": [
      "arn:aws:kms:us-west-2:<account-id>:key/<key-id>"
    ]
  }]
}

If the EMR cluster instances fail, then you might see any of these errors:

2022-10-17 15:59:24,736 attempt 12/1000: http://repo.eu-west-1.amazonaws.com/2018.03/main/mirror.list
2022-10-17 15:59:34,741 exception: [Errno 12] Timeout on http://repo.eu-west-1.amazonaws.com/2018.03/main/mirror.list: (28, 'Connection timed out after 10001 milliseconds')
2022-10-17 15:59:34,741 attempt 13/1000: http://repo.eu-west-1.amazonaws.com/2018.03/main/mirror.list
2022-10-17 15:59:44,749 exception: [Errno 12] Timeout on http://repo.eu-west-1.amazonaws.com/2018.03/main/mirror.list: (28, 'Connection timed out after 10000 milliseconds')

To troubleshoot these errors, review the system log by following these steps:

1.    Log in to the Amazon Elastic Compute Cloud (Amazon EC2) console.

2.    Select the EC2 node that's terminated due to cluster failure.

        Note: The terminated node is available for only 1-2 hours on the EC2 console.

3.    Select the Actions dropdown list, and then select Monitor.

4.    Select Troubleshoot, and then select Get system log.

Virtual private cloud (VPC) subnet routes

Make sure that the VPC subnet routes are configured correctly for the data source that your cluster is using. Follow the steps in Set up a VPC to host clusters.

Security groups

Make sure that the master and core/task security groups are configured correctly for the subnet. For more information, see Working with Amazon EMR-managed security groups.

All required actions in your EMR cluster must also be allowed in the default Amazon EMR roles and the instance profile role.

After the VPC subnet routes, security groups, and roles are configured, launch a new cluster.


Related information

Configure networking

AWS OFFICIAL
AWS OFFICIALUpdated a year ago