-
The Issue:
We have a Spark EMR cluster which connects to a remote hive metastore to use our emr hive data warehouse.
When executing Pyspark statement in Zeppelin notebook: sc.sql("create table userdb_emr_search.test_table (id int, attr string)")
Got this exception:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
-
EMR Spark cluster configures:
Release label:emr-6.3.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.1.1, JupyterHub 1.2.0, Ganglia 3.7.2, Zeppelin 0.9.0
-
The class org.apache.hadoop.fs.s3a.S3AFileSystem has its ClassPath on spark class path correctly:
'spark.executor.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:....
'spark.driver.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:....
-
Jar files are in the right places:
Under /usr/lib/hadoop:
-rw-r--r-- 1 root root 501704 Mar 30 2021 hadoop-aws-3.2.1-amzn-3.jar
lrwxrwxrwx 1 root root 27 Sep 8 01:50 hadoop-aws.jar -> hadoop-aws-3.2.1-amzn-3.jar
-rw-r--r-- 1 root root 4175105 Mar 30 2021 hadoop-common-3.2.1-amzn-3.jar
lrwxrwxrwx 1 root root 30 Sep 8 01:50 hadoop-common.jar -> hadoop-common-3.2.1-amzn-3.jar
Under /usr/share/aws/aws-java-sdk/:
-rw-r--r-- 1 root root 216879203 Apr 1 2021 aws-java-sdk-bundle-1.11.977.jar
-
Hadoop storage:
Use Amazon S3 for Hadoop storage instead of HDFS
-
Error log when executing spark sql create table in Zeppelin notebook:
WARN [2022-09-05 03:24:11,785] ({SchedulerFactory3} NotebookServer.java[onStatusChange]:1928) - Job paragraph_1662330571651_66787638 is finished, status: ERROR, exception: null, result: %text Fail to execute line 2: sc.sql("create table userdb_emr_search.test_table (id int, attr string)")
Traceback (most recent call last):
File "/tmp/1662348163304-0/zeppelin_python.py", line 158, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 2, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call
answer, self.gateway_client, self.target_id, self.name)
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 117, in deco
raise converted from None
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)
INFO [2022-09-05 03:24:11,785] ({SchedulerFactory3} VFSNotebookRepo.java[save]:144) - Saving note 2HDK22P2Z to Untitled Note 1_2HDK22P2Z.zpln
Please help investigate why spark sql cannot see the class org.apache.hadoop.fs.s3a.S3AFileSystem even its jar files are in right place and have correct ClassPath.
Thanks for your response. The ClassNotFoundException only occurs when using spark.sql to create a hive table. We can insert data into an existing hive table without issues. Also tried s3:// and got the same error.