I have been colliding with this problem for several days now, and am at my wits end.
We have an EMR cluster that we launch to process data into a Hudi data set. We start the cluster with an api call, specifying Spark, Hive, Tez, and EMR release label emr-5.29.0. We set a few configurations, notably "hive.metastore.client.factory.class", for glue with "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory". We then run our little script, which calls a spark.sql query, then writes to the Hudi set. We follow the configuration steps laid out here: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hudi-work-with-dataset.html.
The script works in Scala on a EMR Notebook, as well as from the spark-shell. But when we try to spark-submit it, we get the following error:
'java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.<init>(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V;'
To compound the frustration, we had originally written the script in Python, but we encounter this same error within the EMR Notebook, the pyspark shell, and the spark-submit module.
Some preliminary reading suggests that this is caused by incompatible jars, based on process of elimination we think that it is something with the Hudi jars and the AWS Glue functionality.
Does anyone have any thoughts?
The whole stack trace, if it helps any:
at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.<init>(SdkTLSSocketFactory.java:58)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:93)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:66)
at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:59)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:50)
at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:38)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:324)
at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:308)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:237)
at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:223)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:177)
at com.amazonaws.services.glue.AWSGlueClient.<init>(AWSGlueClient.java:163)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:61)
at com.amazonaws.services.glue.AWSGlueClientBuilder.build(AWSGlueClientBuilder.java:27)
at com.amazonaws.client.builder.AwsSyncClientBuilder.build(AwsSyncClientBuilder.java:46)
at com.amazonaws.glue.catalog.metastore.AWSGlueClientFactory.newClient(AWSGlueClientFactory.java:72)
at com.amazonaws.glue.catalog.metastore.AWSCatalogMetastoreClient.<init> (AWSCatalogMetastoreClient.java:146)
at com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory.createMetaStoreClient(AWSGlueDataCatalogHiveClientFactory.java:16)
at org.apache.hadoop.hive.ql.metadata.Hive.createMetaStoreClient(Hive.java:3007)
at org.apache.hadoop.hive.ql.metadata.Hive.getMSC(Hive.java:3042)
at org.apache.hadoop.hive.ql.metadata.Hive.getAllDatabases(Hive.java:1235)
at org.apache.hadoop.hive.ql.metadata.Hive.reloadFunctions(Hive.java:175)
at org.apache.hadoop.hive.ql.metadata.Hive.<clinit>(Hive.java:167)
at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:503)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:117)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:271)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:384)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:286)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:66)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:65)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply$mcZ$sp(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$databaseExists$1.apply(HiveExternalCatalog.scala:215)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97)
... 77 more