EMR Serverless jar configuration

0

I'm working on EMR Serverless for validating of data located in s3 using deequ library. But I am Unable to do that , I got this error...

Traceback (most recent call last): File "/tmp/spark-f2b9d6f8-9bb9-4879-a398-1f67f9ec5e70/app3.py", line 179, in <module> .addConstraintRule(UniqueIfApproximatelyUniqueRule())
File "/home/hadoop/environment/lib64/python3.7/site-packages/pydeequ/suggestions.py", line 81, in run result = self._ConstraintSuggestionRunBuilder.run() File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1322, in call File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco File "/usr/lib/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o177.run. : com.amazon.deequ.analyzers.runners.MetricCalculationRuntimeException: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 19) ([2600:1f18:2d85:5e03:ba20:a78b:a26c:61ab] executor 1): java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.aggregate.HashAggregateExec.aggregateExpressions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.aggregate.HashAggregateExec

how could I resolve this error

asked a year ago990 views
1 Answer
0

Hello,

Thank you for raising this question on re:Post.

From the stacktrace shared I can see it is a ClassCastException during serialization, which points to some incompatibility of the classes for serialization in the spark application. However, it is not enough to clearly identify the root cause here. Please help us with the following so that we can assist you further on this

  1. Are you able to run this successfully on EMR on EC2 cluster?
  2. Are you able to run a test without deequ to confirm EMR serverless job is working as expected without this dependency?
  3. Please share how you are adding the additional deequ libraries to the runtime serverless environment. Your start-job-run command should have the details on this.
  4. Please share reproduction steps, including a link to download the deequ library if it is publicly available.
AWS
SUPPORT ENGINEER
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions