No Such Method Error with Spark and Hudi

0

I have been colliding with this problem for several days now, and am at my wits end.

We have an EMR cluster that we launch to process data into a Hudi data set. We start the cluster with an api call, specifying Spark, Hive, Tez, and EMR release label emr-5.29.0. We set a few configurations, notably "hive.metastore.client.factory.class", for glue with "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". We then run our little script, which calls a spark.sql query, then writes to the Hudi set. We follow the configuration steps laid out here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.

The script works in Scala on a EMR Notebook, as well as from the spark-shell. But when we try to spark-submit it, we get the following error:
'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
To compound the frustration, we had originally written the script in Python, but we encounter this same error within the EMR Notebook, the pyspark shell, and the spark-submit module.

Some preliminary reading suggests that this is caused by incompatible jars, based on process of elimination we think that it is something with the Hudi jars and the AWS Glue functionality.

Does anyone have any thoughts?

The whole stack trace, if it helps any:
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.<init>(SdkTLSSocketFactory.java:58)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:93)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:66)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:59)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:324)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:308)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:237)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:223)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:177)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:163)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:61)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:27)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.amazonaws.glue.catalog.metastore.AWSGlueClientFactory.newClient(AWSGlueClientFactory.java:72)
at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init> (AWSCatalogMetastoreClient.java:146)
at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:16)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3007)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3042)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1235)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:175)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:167)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 77 more

JCubeta
質問済み 4年前790ビュー
2回答
0

Got the answer for this from a kindly person on the Hudi slack channel.

The issue is with httpclient. By specifying a version when we passed in the jars, everything resolved:
--jars /usr/lib/spark/jars/httpclient-4.5.9.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

Hope it helps someone else.

JCubeta
回答済み 4年前
0

absolute lifesaver. the AWS Hudi docs do not specify the httpclient jar, only the other 2. thank you!!!!!!!!!!

ilya745
回答済み 4年前

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ