Executing hive create table in Spark.sql -- java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

0
  1. The Issue: We have a Spark EMR cluster which connects to a remote hive metastore to use our emr hive data warehouse. When executing Pyspark statement in Zeppelin notebook: sc.sql("create table userdb_emr_search.test_table (id int, attr string)") Got this exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

  2. EMR Spark cluster configures: Release label:emr-6.3.0 Hadoop distribution:Amazon 3.2.1 Applications:Spark 3.1.1, JupyterHub 1.2.0, Ganglia 3.7.2, Zeppelin 0.9.0

  3. The class org.apache.hadoop.fs.s3a.S3AFileSystem has its ClassPath on spark class path correctly: 'spark.executor.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:.... 'spark.driver.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:....

  4. Jar files are in the right places: Under /usr/lib/hadoop: -rw-r--r-- 1 root root 501704 Mar 30 2021 hadoop-aws-3.2.1-amzn-3.jar lrwxrwxrwx 1 root root 27 Sep 8 01:50 hadoop-aws.jar -> hadoop-aws-3.2.1-amzn-3.jar -rw-r--r-- 1 root root 4175105 Mar 30 2021 hadoop-common-3.2.1-amzn-3.jar lrwxrwxrwx 1 root root 30 Sep 8 01:50 hadoop-common.jar -> hadoop-common-3.2.1-amzn-3.jar

    Under /usr/share/aws/aws-java-sdk/: -rw-r--r-- 1 root root 216879203 Apr 1 2021 aws-java-sdk-bundle-1.11.977.jar

  5. Hadoop storage: Use Amazon S3 for Hadoop storage instead of HDFS

  6. Error log when executing spark sql create table in Zeppelin notebook:

WARN [2022-09-05 03:24:11,785] ({SchedulerFactory3} NotebookServer.java[onStatusChange]:1928) - Job paragraph_1662330571651_66787638 is finished, status: ERROR, exception: null, result: %text Fail to execute line 2: sc.sql("create table userdb_emr_search.test_table (id int, attr string)") Traceback (most recent call last): File "/tmp/1662348163304-0/zeppelin_python.py", line 158, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 2, in <module> File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)

INFO [2022-09-05 03:24:11,785] ({SchedulerFactory3} VFSNotebookRepo.java[save]:144) - Saving note 2HDK22P2Z to Untitled Note 1_2HDK22P2Z.zpln

Please help investigate why spark sql cannot see the class org.apache.hadoop.fs.s3a.S3AFileSystem even its jar files are in right place and have correct ClassPath.

asked 2 years ago2599 views
1 Answer
0

Hi,

Thanks for writing to re:Post.

As I understand you are facing an issue on your spark job that is failing with the exception " java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found".

As stated in this document [1] the usage of s3a filesystem is not recommended, and the suggestion is to leverage the EMRFS (i.e. using the s3:// scheme in place the s3a:// scheme). The EMRFS will grant you best performance, security, and reliability.

That being said, in order to solve your issue, I would suggest you to replace the uri scheme "s3a://" with "s3://". And let us know how it goes at your end.

Thanks.

[1] Work with storage and file systems - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

AWS
SUPPORT ENGINEER
answered 2 years ago
  • Thanks for your response. The ClassNotFoundException only occurs when using spark.sql to create a hive table. We can insert data into an existing hive table without issues. Also tried s3:// and got the same error.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions