org.apache.spark.SparkUpgradeException when using AWS Glue DQ

0

Hi, we are testing with using AWS Glue DQ to run some checks on the tables. We have configured the check to run against a data catalog table.

When running a check on the date column, we are getting the below error. We had seen these issue previously when running spark on EMR but we were able to add spark properties and set value to LEGACY to resolve our issue.

With AWS glue, we don't have access to the spark properties. What is the recommended fix for this?

Rule: Rules = [ ColumnValues "date" > (now() - 1 days) ]

Error: Exception in User Class: org.apache.spark.SparkException : Job aborted due to stage failure: Task 15 in stage 1.0 failed 4 times, most recent failure: Lost task 15.3 in stage 1.0 (TID 22) (172.35.90.225 executor 3): org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files can be ambiguous, as the files may be written by Spark 2.x or legacy versions of Hive, which uses a legacy hybrid calendar that is different from Spark 3.0+'s Proleptic Gregorian calendar. See more details in SPARK-31404. You can set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'LEGACY' to rebase the datetime values w.r.t. the calendar difference during reading. Or set spark.sql.legacy.parquet.datetimeRebaseModeInRead to 'CORRECTED' to read the datetime values as it is.

질문됨 일 년 전1214회 조회
2개 답변
0

Hello,

Thanks for reaching out.

As you already know, the issue is related to parquet timestamps format between Spark 2.x and Spark 3.x.

Regarding the "LEGACY" setting, you can also do that Glue as well, the setting is accessible at "Job details" -> "Advanced properties" -> "Job parameters", per documented at [1].

Besides, per documented at [2], Glue 2.0 uses Spark 2.4.3, meanwhile Glue 3.0+ uses Spark 3.x, so you can also try using Glue 2.0 to use Spark 2.4.3 as a workaround.

Hope the above information is helpful.

================ Reference:

[1] - https://docs.aws.amazon.com/glue/latest/dg/migrating-version-40.html#migrating-version-40-from-20

[2] - https://docs.aws.amazon.com/glue/latest/dg/release-notes.html

AWS
Thi_N
답변함 일 년 전
  • Hi,

    Thanks for the responses.

    I'm specifically using AWS Glue DQ to run data quality rulesets which doesn't have configuration for these parameters.

0

The issue you're experiencing stems from the upgrade to Spark 3.0, which uses a different calendar system (Proleptic Gregorian) compared to Spark 2.x and legacy versions of Hive. This can cause ambiguity when reading dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z from Parquet files, as these files may have been written by Spark 2.x or legacy versions of Hive that use a legacy hybrid calendar. The error message you've provided suggests two solutions: setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to LEGACY to rebase the datetime values with respect to the calendar difference during reading, or setting spark.sql.legacy.parquet.datetimeRebaseModeInRead to CORRECTED to read the datetime values as they are.

In AWS Glue, there are multiple ways to set Spark properties, but it seems these methods don't directly allow the setting of Spark SQL properties. However, you could potentially use a workaround that involves running Spark SQL commands in Glue ETL scripts to set these properties. Unfortunately, due to time constraints, I was unable to confirm the exact steps to do this.

One way you could potentially solve this issue is by transforming your data such that it doesn't include dates before 1582-10-15 or timestamps before 1900-01-01T00:00:00Z. This might involve adding an additional preprocessing step before your AWS Glue DQ checks.

profile picture
전문가
답변함 일 년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠