No Such Method Error with Spark and Hudi

0

I have been colliding with this problem for several days now, and am at my wits end.

We have an EMR cluster that we launch to process data into a Hudi data set. We start the cluster with an api call, specifying Spark, Hive, Tez, and EMR release label emr-5.29.0. We set a few configurations, notably "hive.metastore.client.factory.class", for glue with "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". We then run our little script, which calls a spark.sql query, then writes to the Hudi set. We follow the configuration steps laid out here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.

The script works in Scala on a EMR Notebook, as well as from the spark-shell. But when we try to spark-submit it, we get the following error:
'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
To compound the frustration, we had originally written the script in Python, but we encounter this same error within the EMR Notebook, the pyspark shell, and the spark-submit module.

Some preliminary reading suggests that this is caused by incompatible jars, based on process of elimination we think that it is something with the Hudi jars and the AWS Glue functionality.

Does anyone have any thoughts?

The whole stack trace, if it helps any:
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.<init>(SdkTLSSocketFactory.java:58)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:93)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:66)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:59)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:324)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:308)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:237)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:223)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:177)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:163)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:61)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:27)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.amazonaws.glue.catalog.metastore.AWSGlueClientFactory.newClient(AWSGlueClientFactory.java:72)
at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init> (AWSCatalogMetastoreClient.java:146)
at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:16)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3007)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3042)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1235)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:175)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:167)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 77 more

JCubeta
demandé il y a 4 ans790 vues
2 réponses
0

Got the answer for this from a kindly person on the Hudi slack channel.

The issue is with httpclient. By specifying a version when we passed in the jars, everything resolved:
--jars /usr/lib/spark/jars/httpclient-4.5.9.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

Hope it helps someone else.

JCubeta
répondu il y a 4 ans
0

absolute lifesaver. the AWS Hudi docs do not specify the httpclient jar, only the other 2. thank you!!!!!!!!!!

ilya745
répondu il y a 4 ans

Vous n'êtes pas connecté. Se connecter pour publier une réponse.

Une bonne réponse répond clairement à la question, contient des commentaires constructifs et encourage le développement professionnel de la personne qui pose la question.

Instructions pour répondre aux questions