org.apache.spark.SparkUpgradeException when using AWS Glue DQ

0

Hi, we are testing with using AWS Glue DQ to run some checks on the tables. We have configured the check to run against a data catalog table.

When running a check on the date column, we are getting the below error. We had seen these issue previously when running spark on EMR but we were able to add spark properties and set value to LEGACY to resolve our issue.

With AWS glue, we don't have access to the spark properties. What is the recommended fix for this?

Rule: Rules = [ ColumnValues "date" > (now() - 1 days) ]

Error: Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 15 in stage 1.0 failed 4 times, most recent failure: Lost task 15.3 in stage 1.0 (TID 22) (172.35.90.225 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

asked a year ago1170 views
2 Answers
0

Hello,

Thanks for reaching out.

As you already know, the issue is related to parquet timestamps format between Spark 2.x and Spark 3.x.

Regarding the "LEGACY" setting, you can also do that Glue as well, the setting is accessible at "Job details" -> "Advanced properties" -> "Job parameters", per documented at [1].

Besides, per documented at [2], Glue 2.0 uses Spark 2.4.3, meanwhile Glue 3.0+ uses Spark 3.x, so you can also try using Glue 2.0 to use Spark 2.4.3 as a workaround.

Hope the above information is helpful.

================ Reference:

[1] - https://docs.aws.amazon.com/glue/latest/dg/migrating-version-40.html#migrating-version-40-from-20

[2] - https://docs.aws.amazon.com/glue/latest/dg/release-notes.html

AWS
Thi_N
answered a year ago
  • Hi,

    Thanks for the responses.

    I'm specifically using AWS Glue DQ to run data quality rulesets which doesn't have configuration for these parameters.

0

The issue you're experiencing stems from the upgrade to Spark 3.0, which uses a different calendar system (Proleptic Gregorian) compared to Spark 2.x and legacy versions of Hive. This can cause ambiguity when reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files, as these files may have been written by Spark 2.x or legacy versions of Hive that use a legacy hybrid calendar. The error message you've provided suggests two solutions: setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to LEGACY to rebase the datetime values with respect to the calendar difference during reading, or setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to CORRECTED to read the datetime values as they are.

In AWS Glue, there are multiple ways to set Spark properties, but it seems these methods don't directly allow the setting of Spark SQL properties. However, you could potentially use a workaround that involves running Spark SQL commands in Glue ETL scripts to set these properties. Unfortunately, due to time constraints, I was unable to confirm the exact steps to do this.

One way you could potentially solve this issue is by transforming your data such that it doesn't include dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z. This might involve adding an additional preprocessing step before your AWS Glue DQ checks.

profile picture
EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions