By using AWS re:Post, you agree to the AWS re:Post Terms of Use

EMR Spark Submit Spark History Server Internal Error

0

Since upgrading from EMR 6.X to EMR 7.X, the spark history server produces three distinct Error 500s, the first being a basic JSON exception: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input in field name at [Source: (String)"{"Event":"SparkListenerTaskEnd","Stage ID":7,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":28745,"Index":4983,"Attempt":0,"Partition ID":4983,"Launch Time":1732563560720,""; line: 1, column: 238]

The second issue is an issue with JSON and log truncation: com.fasterxml.jackson.core.io.JsonEOFException: Unexpected end-of-input: expected close marker for Object (start marker at [Source: (String)"{"Event":"SparkListenerTaskEnd","Stage ID":7,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":33240,"Index":9478,"Attempt":0,"Partition ID":9478,"Launch Time":1732563920000,"Executor ID":"2","Host":"ip-10-0-104-23.ec2.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1732563922955,"Failed":false,"Killed":false,"Accumulables":[{"ID":271,"Name":"data size","Update":"16","Value":"151696","Int"[truncated 2126 chars]; line: 1, column: 2520]) at [Source: (String)"{"Event":"SparkListenerTaskEnd","Stage ID":7,"Stage Attempt ID":0,"Task Type":"ShuffleMapTask","Task End Reason":{"Reason":"Success"},"Task Info":{"Task ID":33240,"Index":9478,"Attempt":0,"Partition ID":9478,"Launch Time":1732563920000,"Executor ID":"2","Host":"ip-10-0-104-23.ec2.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":1732563922955,"Failed":false,"Killed":false,"Accumulables":[{"ID":271,"Name":"data size","Update":"16","Value":"151696","Int"[truncated 2126 chars]; line: 1, column: 2627]

The third issue is an issue with s3a: org.apache.hadoop.fs.s3a.AWSUnsupportedFeatureException: open s3a://prod.us-east-1.appinfo.src/j-REDACTED/sparklogs/application_1732562560484_0002_1 at 0 on s3a://prod.us-east-1.appinfo.src/j-REDACTED/sparklogs/application_1732562560484_0002_1: software.amazon.awssdk.services.s3.model.S3Exception: At least one of the pre-conditions you specified did not hold (Service: S3, Status Code: 412, Request ID: QEXK1G0EXEHF61TR, Extended Request ID: REDACTED):PreconditionFailed: At least one of the pre-conditions you specified did not hold (Service: S3, Status Code: 412, Request ID: QEXK1G0EXEHF61TR, Extended Request ID: REDACTED)

The only information I can find on resolution is updating the configuration --> Classification: spark-env --> SPARK_DAEMON_MEMORY: 4g (default is 1g) This doesn't work for the second issue and shouldn't effect the other issues.

I have also tuned some spark-defaults (i.e., spark.history.fs, store).

Any support would be great and I'm sure helpful for others in my situation as it makes debugging spark pipelines incredibly difficult.

asked 13 days ago58 views
1 Answer
1
Accepted Answer

The issues you're experiencing with the Spark History Server after upgrading from EMR 6.X to EMR 7.X are complex and can be attributed to several factors. Let's address each problem and provide some potential solutions:

  1. JSON Exception: This error suggests that there might be incomplete or malformed JSON data in the Spark event logs. This could be due to abrupt termination of tasks or issues with log writing. To mitigate this:
  • Ensure that your Spark jobs are completing successfully and not being terminated unexpectedly.
  • Check if there are any disk space issues on the nodes that might be causing log truncation.
  1. JSON and Log Truncation: This error indicates that the JSON data is being truncated, possibly due to size limitations. To address this:
  • Increase the maximum size of event log files by setting spark.history.fs.eventLog.rolling.maxFileSize to a larger value.
  • Adjust spark.eventLog.buffer.kb to increase the buffer size for event logging.
  1. S3A Issue: This error suggests problems with accessing S3 for reading Spark history logs. To resolve this:
  • Ensure that the EMR cluster has the necessary permissions to access the S3 bucket where the logs are stored.
  • Check if there are any S3 bucket policies or ACLs that might be restricting access.
  • Verify that the S3 endpoint is correctly configured and accessible from your EMR cluster.

Additional recommendations:

  1. Updating SPARK_DAEMON_MEMORY to 4g is a good start, but you might need to increase it further depending on the size of your logs and the number of applications.

  2. Consider adjusting the following Spark configurations:

    • spark.history.fs.cleaner.enabled: Set to true to enable automatic cleaning of old logs.
    • spark.history.fs.cleaner.maxAge: Adjust to control how long to retain old logs.
    • spark.history.fs.cleaner.interval: Set the frequency of the cleaning process.
  3. If you're using EMR on EKS, ensure that your Kubernetes cluster has sufficient resources allocated for the Spark History Server.

  4. Review your Spark application configurations to ensure they're optimized for EMR 7.X, as there might be changes in default settings or behavior compared to 6.X.

  5. Check the EMR and Spark release notes for any known issues or changes that might affect the History Server in version 7.X.

If these solutions don't resolve the issues, you may need to engage AWS support for a more in-depth investigation, as there could be specific interactions between your workload characteristics and the EMR 7.X environment causing these problems.

Remember to test any configuration changes in a non-production environment first to ensure they don't negatively impact your Spark jobs or cluster performance.

Sources
Amazon EMR: Custom JAR cluster errors - Amazon EMR
AWS re:Post Knowledge Center Spotlight: Amazon EMR | AWS re:Post
5.1 - Spark General | AWS Open Data Analytics
Common Errors - Amazon EMR on EKS

profile picture
answered 13 days ago
profile picture
EXPERT
reviewed 13 days ago
  • Tuning those parameters helped, it is still failing to render intermittently, but I imagine I just need to turn up the tolerances. Since spark was upgraded from 3.3.X to 3.5.0 between those EMR versions, a root cause may be the event log buffer actually setting in bytes instead of kilobytes (see this JIRA ticket for spark 3.5.0 https://issues.apache.org/jira/browse/SPARK-45333 ). Tuning that parameter alone did not solve the problem however.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions