- Newest
- Most votes
- Most comments
The issues you're experiencing with the Spark History Server after upgrading from EMR 6.X to EMR 7.X are complex and can be attributed to several factors. Let's address each problem and provide some potential solutions:
- JSON Exception: This error suggests that there might be incomplete or malformed JSON data in the Spark event logs. This could be due to abrupt termination of tasks or issues with log writing. To mitigate this:
- Ensure that your Spark jobs are completing successfully and not being terminated unexpectedly.
- Check if there are any disk space issues on the nodes that might be causing log truncation.
- JSON and Log Truncation: This error indicates that the JSON data is being truncated, possibly due to size limitations. To address this:
- Increase the maximum size of event log files by setting
spark.history.fs.eventLog.rolling.maxFileSize
to a larger value. - Adjust
spark.eventLog.buffer.kb
to increase the buffer size for event logging.
- S3A Issue: This error suggests problems with accessing S3 for reading Spark history logs. To resolve this:
- Ensure that the EMR cluster has the necessary permissions to access the S3 bucket where the logs are stored.
- Check if there are any S3 bucket policies or ACLs that might be restricting access.
- Verify that the S3 endpoint is correctly configured and accessible from your EMR cluster.
Additional recommendations:
-
Updating
SPARK_DAEMON_MEMORY
to 4g is a good start, but you might need to increase it further depending on the size of your logs and the number of applications. -
Consider adjusting the following Spark configurations:
spark.history.fs.cleaner.enabled
: Set to true to enable automatic cleaning of old logs.spark.history.fs.cleaner.maxAge
: Adjust to control how long to retain old logs.spark.history.fs.cleaner.interval
: Set the frequency of the cleaning process.
-
If you're using EMR on EKS, ensure that your Kubernetes cluster has sufficient resources allocated for the Spark History Server.
-
Review your Spark application configurations to ensure they're optimized for EMR 7.X, as there might be changes in default settings or behavior compared to 6.X.
-
Check the EMR and Spark release notes for any known issues or changes that might affect the History Server in version 7.X.
If these solutions don't resolve the issues, you may need to engage AWS support for a more in-depth investigation, as there could be specific interactions between your workload characteristics and the EMR 7.X environment causing these problems.
Remember to test any configuration changes in a non-production environment first to ensure they don't negatively impact your Spark jobs or cluster performance.
Sources
Amazon EMR: Custom JAR cluster errors - Amazon EMR
AWS re:Post Knowledge Center Spotlight: Amazon EMR | AWS re:Post
5.1 - Spark General | AWS Open Data Analytics
Common Errors - Amazon EMR on EKS
Relevant content
- asked 2 years ago
- asked 2 years ago
- Accepted Answerasked 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 3 months ago
- AWS OFFICIALUpdated 2 months ago
Tuning those parameters helped, it is still failing to render intermittently, but I imagine I just need to turn up the tolerances. Since spark was upgraded from 3.3.X to 3.5.0 between those EMR versions, a root cause may be the event log buffer actually setting in bytes instead of kilobytes (see this JIRA ticket for spark 3.5.0 https://issues.apache.org/jira/browse/SPARK-45333 ). Tuning that parameter alone did not solve the problem however.