Skip to content

EMR Spark error on df.count

0

Hello! We're trying to migrate from a stand-alone Hive Metastore to Glue. We've modified the definition of some EMR clusters (v7.0.0) to use Glue as the metastore, we use Spark on Hadoop to process data. No other change was applied. Some jobs (3/36) have failed after the change in the EMR cluster. The error is not on the logic or the data: the scripts are the same, and the data is the same (Hive Metastore and Glue only have the metadata, the data is in S3). The error comes up from some df.count() operations, and the error trace is purely from Spark, non of it is our code. While looking for clues on Spark's side, we've only found a related bug in an older version of Spark (3.3.0) that was already fixed. In some cases, the df has only one string field, so it's definitely not a data issue. Do you have any hint of how to debug it or how to fix it?

This is the error trace:

py4j.protocol.Py4JJavaError: An error occurred while calling o1531.count. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 28 in stage 0.0 failed 4 times, most recent failure: Lost task 28.3 in stage 0.0 (TID 121) ([redacted] executor 4): java.lang.NumberFormatException: For input string: "" at java.base/java.lang.NumberFormatException.forInputString(NumberFormatException.java:67) at java.base/java.lang.Long.parseLong(Long.java:721) at java.base/java.lang.Long.parseLong(Long.java:836) at java.base/java.text.DigitList.getLong(DigitList.java:195) at java.base/java.text.DecimalFormat.parse(DecimalFormat.java:2197) at java.base/java.text.SimpleDateFormat.subParse(SimpleDateFormat.java:2244) at java.base/java.text.SimpleDateFormat.parse(SimpleDateFormat.java:1545) at java.base/java.text.DateFormat.parse(DateFormat.java:397) at org.openx.data.jsonserde.objectinspector.primitive.JavaStringDateObjectInspector.parse(JavaStringDateObjectInspector.java:63) at org.openx.data.jsonserde.objectinspector.primitive.JavaStringDateObjectInspector.getPrimitiveJavaObject(JavaStringDateObjectInspector.java:54) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$15(TableReader.scala:528) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$15$adapted(TableReader.scala:527) at org.apache.spark.sql.hive.HadoopTableReader$.$anonfun$fillObject$18(TableReader.scala:551) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at scala.collection.Iterator$$anon$10.next(Iterator.scala:461) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.UnsafeRowInterceptor.hasNext(UnsafeRowInterceptor.java:24) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.hasNext(Unknown Source) at org.apache.spark.sql.execution.WholeStageCodegenEvaluatorFactory$WholeStageCodegenPartitionEvaluator$$anon$1.hasNext(WholeStageCodegenEvaluatorFactory.scala:43) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460) at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.start(UnsafeShuffleWriter.java:228) at org.apache.spark.shuffle.DirectShuffleWriteProcessor.doWrite(DirectShuffleWriteProcessor.scala:44) at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:68) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:104) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:54) at org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:161) at org.apache.spark.scheduler.Task.run(Task.scala:143) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:629) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64) at org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:95) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:632) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) at java.base/java.lang.Thread.run(Thread.java:840)

asked 8 months ago192 views
2 Answers
0

The error you're encountering appears to be related to date parsing in your Spark job. The specific error java.lang.NumberFormatException: For input string: "" suggests that Spark is trying to parse an empty string as a date value when reading data from your table.

This issue might be related to how date fields are handled differently between Hive Metastore and AWS Glue Catalog. When you migrated from Hive Metastore to Glue, the metadata for date columns might be interpreted differently, causing parsing errors when empty strings are encountered.

There are a few approaches you can try to resolve this:

  1. Check date column formats: Review the schema of the tables that are causing issues, particularly focusing on date columns. The error trace shows the issue is in JavaStringDateObjectInspector, which handles date conversions.

  2. Handle null/empty values: You might need to add data quality checks or transformations to handle empty strings in date fields before performing operations like count().

  3. Adjust date format settings: You may need to specify date formats explicitly in your Spark configuration to ensure consistent parsing between Hive and Glue.

  4. Use dynamic frames: If you're working with AWS Glue, consider converting your Spark dataframe to a dynamic frame after operations like exceptAll() and then performing the count. This approach has helped others with similar issues.

  5. Check for Spark version issues: While you mentioned EMR 7.0.0, it's worth confirming the exact Spark version. There was a known bug in Spark 3.3.0 related to operations like count() after exceptAll() that was fixed in Spark 3.3.1.

For immediate debugging, you could try adding data validation steps before the count() operation to identify and handle problematic records, or implement error handling around the date parsing logic.
Sources
Unable to do count or isempty call in dataframe with exceptall | AWS re:Post
EMR with container images on yarn python version error | AWS re:Post

answered 8 months ago
  • I've checked the recommended steps.

    1. Check date column formats: Review the schema of the tables that are causing issues, particularly focusing on date columns. The error trace shows the issue is in JavaStringDateObjectInspector, which handles date conversions. --> The df has only one string field, it's not a conversion issue.

    2. Handle null/empty values: You might need to add data quality checks or transformations to handle empty strings in date fields before performing operations like count(). --> There are no null/empty values in the df. The step that fails is actually a data quality check and it logs some data from the df.

    3. Adjust date format settings: You may need to specify date formats explicitly in your Spark configuration to ensure consistent parsing between Hive and Glue. --> It's not related to date / timestamp fields.

    Use dynamic frames: If you're working with AWS Glue, consider converting your Spark dataframe to a dynamic frame after operations like exceptAll() and then performing the count. This approach has helped others with similar issues. --> I'll test this approach, it seems overkill to convert a dataframe into a dynamic frame just to run a count().

    Check for Spark version issues: While you mentioned EMR 7.0.0, it's worth confirming the exact Spark version. There was a known bug in Spark 3.3.0 related to operations like count() after exceptAll() that was fixed in Spark 3.3.1. --> The version of Spark is 3.5.0. The same code doesn't fail with Hive MS.

0

Hey,

In order to help us investigate further, can you please share the region and job run ID?

AWS
SUPPORT ENGINEER
answered 7 months ago
  • The region is eu-central-1. One example of such errors happened in application_1749122512341_0006 (cluster id j-28BL5ST6S261J), executed on 2025-06-05.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.