Executing hive create table in Spark.sql -- java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

0
  1. The Issue: We have a Spark EMR cluster which connects to a remote hive metastore to use our emr hive data warehouse. When executing Pyspark statement in Zeppelin notebook: sc.sql("create table userdb_emr_search.test_table (id int, attr string)") Got this exception: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

  2. EMR Spark cluster configures: Release label:emr-6.3.0 Hadoop distribution:Amazon 3.2.1 Applications:Spark 3.1.1, JupyterHub 1.2.0, Ganglia 3.7.2, Zeppelin 0.9.0

  3. The class org.apache.hadoop.fs.s3a.S3AFileSystem has its ClassPath on spark class path correctly: 'spark.executor.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:.... 'spark.driver.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/:....

  4. Jar files are in the right places: Under /usr/lib/hadoop: -rw-r--r-- 1 root root 501704 Mar 30 2021 hadoop-aws-3.2.1-amzn-3.jar lrwxrwxrwx 1 root root 27 Sep 8 01:50 hadoop-aws.jar -> hadoop-aws-3.2.1-amzn-3.jar -rw-r--r-- 1 root root 4175105 Mar 30 2021 hadoop-common-3.2.1-amzn-3.jar lrwxrwxrwx 1 root root 30 Sep 8 01:50 hadoop-common.jar -> hadoop-common-3.2.1-amzn-3.jar

    Under /usr/share/aws/aws-java-sdk/: -rw-r--r-- 1 root root 216879203 Apr 1 2021 aws-java-sdk-bundle-1.11.977.jar

  5. Hadoop storage: Use Amazon S3 for Hadoop storage instead of HDFS

  6. Error log when executing spark sql create table in Zeppelin notebook:

WARN [2022-09-05 03:24:11,785] ({SchedulerFactory3} NotebookServer.java[onStatusChange]:1928) - Job paragraph_1662330571651_66787638 is finished, status: ERROR, exception: null, result: %text Fail to execute line 2: sc.sql("create table userdb_emr_search.test_table (id int, attr string)") Traceback (most recent call last): File "/tmp/1662348163304-0/zeppelin_python.py", line 158, in <module> exec(code, _zcUserQueryNameSpace) File "<stdin>", line 2, in <module> File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in call answer, self.gateway_client, self.target_id, self.name) File "/usr/lib/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)

INFO [2022-09-05 03:24:11,785] ({SchedulerFactory3} VFSNotebookRepo.java[save]:144) - Saving note 2HDK22P2Z to Untitled Note 1_2HDK22P2Z.zpln

Please help investigate why spark sql cannot see the class org.apache.hadoop.fs.s3a.S3AFileSystem even its jar files are in right place and have correct ClassPath.

gefragt vor 2 Jahren2629 Aufrufe
1 Antwort
0

Hi,

Thanks for writing to re:Post.

As I understand you are facing an issue on your spark job that is failing with the exception " java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found".

As stated in this document [1] the usage of s3a filesystem is not recommended, and the suggestion is to leverage the EMRFS (i.e. using the s3:// scheme in place the s3a:// scheme). The EMRFS will grant you best performance, security, and reliability.

That being said, in order to solve your issue, I would suggest you to replace the uri scheme "s3a://" with "s3://". And let us know how it goes at your end.

Thanks.

[1] Work with storage and file systems - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

AWS
SUPPORT-TECHNIKER
beantwortet vor 2 Jahren
  • Thanks for your response. The ClassNotFoundException only occurs when using spark.sql to create a hive table. We can insert data into an existing hive table without issues. Also tried s3:// and got the same error.

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen