- Newest
- Most votes
- Most comments
The error you're encountering appears to be related to date parsing in your Spark job. The specific error java.lang.NumberFormatException: For input string: "" suggests that Spark is trying to parse an empty string as a date value when reading data from your table.
This issue might be related to how date fields are handled differently between Hive Metastore and AWS Glue Catalog. When you migrated from Hive Metastore to Glue, the metadata for date columns might be interpreted differently, causing parsing errors when empty strings are encountered.
There are a few approaches you can try to resolve this:
-
Check date column formats: Review the schema of the tables that are causing issues, particularly focusing on date columns. The error trace shows the issue is in
JavaStringDateObjectInspector, which handles date conversions. -
Handle null/empty values: You might need to add data quality checks or transformations to handle empty strings in date fields before performing operations like
count(). -
Adjust date format settings: You may need to specify date formats explicitly in your Spark configuration to ensure consistent parsing between Hive and Glue.
-
Use dynamic frames: If you're working with AWS Glue, consider converting your Spark dataframe to a dynamic frame after operations like
exceptAll()and then performing the count. This approach has helped others with similar issues. -
Check for Spark version issues: While you mentioned EMR 7.0.0, it's worth confirming the exact Spark version. There was a known bug in Spark 3.3.0 related to operations like
count()afterexceptAll()that was fixed in Spark 3.3.1.
For immediate debugging, you could try adding data validation steps before the count() operation to identify and handle problematic records, or implement error handling around the date parsing logic.
Sources
Unable to do count or isempty call in dataframe with exceptall | AWS re:Post
EMR with container images on yarn python version error | AWS re:Post
Hey,
In order to help us investigate further, can you please share the region and job run ID?
The region is eu-central-1. One example of such errors happened in application_1749122512341_0006 (cluster id j-28BL5ST6S261J), executed on 2025-06-05.
Relevant content
- asked 2 years ago
- asked 4 years ago
- asked 3 years ago

I've checked the recommended steps.
Check date column formats: Review the schema of the tables that are causing issues, particularly focusing on date columns. The error trace shows the issue is in JavaStringDateObjectInspector, which handles date conversions. --> The df has only one string field, it's not a conversion issue.
Handle null/empty values: You might need to add data quality checks or transformations to handle empty strings in date fields before performing operations like count(). --> There are no null/empty values in the df. The step that fails is actually a data quality check and it logs some data from the df.
Adjust date format settings: You may need to specify date formats explicitly in your Spark configuration to ensure consistent parsing between Hive and Glue. --> It's not related to date / timestamp fields.
Use dynamic frames: If you're working with AWS Glue, consider converting your Spark dataframe to a dynamic frame after operations like exceptAll() and then performing the count. This approach has helped others with similar issues. --> I'll test this approach, it seems overkill to convert a dataframe into a dynamic frame just to run a count().
Check for Spark version issues: While you mentioned EMR 7.0.0, it's worth confirming the exact Spark version. There was a known bug in Spark 3.3.0 related to operations like count() after exceptAll() that was fixed in Spark 3.3.1. --> The version of Spark is 3.5.0. The same code doesn't fail with Hive MS.