No Such Method Error with Spark and Hudi

0

I have been colliding with this problem for several days now, and am at my wits end.

We have an EMR cluster that we launch to process data into a Hudi data set. We start the cluster with an api call, specifying Spark, Hive, Tez, and EMR release label emr-5.29.0. We set a few configurations, notably "hive.metastore.client.factory.class", for glue with "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". We then run our little script, which calls a spark.sql query, then writes to the Hudi set. We follow the configuration steps laid out here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.

The script works in Scala on a EMR Notebook, as well as from the spark-shell. But when we try to spark-submit it, we get the following error:
'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
To compound the frustration, we had originally written the script in Python, but we encounter this same error within the EMR Notebook, the pyspark shell, and the spark-submit module.

Some preliminary reading suggests that this is caused by incompatible jars, based on process of elimination we think that it is something with the Hudi jars and the AWS Glue functionality.

Does anyone have any thoughts?

The whole stack trace, if it helps any:
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.<init>(SdkTLSSocketFactory.java:58)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:93)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:66)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:59)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:324)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:308)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:237)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:223)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:177)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:163)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:61)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:27)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.amazonaws.glue.catalog.metastore.AWSGlueClientFactory.newClient(AWSGlueClientFactory.java:72)
at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init> (AWSCatalogMetastoreClient.java:146)
at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:16)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3007)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3042)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1235)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:175)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:167)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 77 more

JCubeta
已提问 4 年前790 查看次数
2 回答
0

Got the answer for this from a kindly person on the Hudi slack channel.

The issue is with httpclient. By specifying a version when we passed in the jars, everything resolved:
--jars /usr/lib/spark/jars/httpclient-4.5.9.jar,/usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

Hope it helps someone else.

JCubeta
已回答 4 年前
0

absolute lifesaver. the AWS Hudi docs do not specify the httpclient jar, only the other 2. thank you!!!!!!!!!!

ilya745
已回答 4 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则