Executing hive create table in Spark.sql -- java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Question

1. The Issue:
We have a Spark EMR cluster which connects to a remote hive metastore to use our emr hive data warehouse.
When executing Pyspark statement in Zeppelin notebook: sc.sql("create table userdb_emr_search.test_table (id int, attr string)")
Got this exception:
            java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

2. EMR Spark cluster configures:
                    Release label:emr-6.3.0
                     Hadoop distribution:Amazon 3.2.1
                     Applications:Spark 3.1.1, JupyterHub 1.2.0, Ganglia 3.7.2, Zeppelin 0.9.0

3. The class org.apache.hadoop.fs.s3a.S3AFileSystem has its ClassPath on spark class path correctly:
             'spark.executor.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:....
             'spark.driver.extraClassPath', '....:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:....

4. Jar files are in the right places:
       Under /usr/lib/hadoop:
            -rw-r--r-- 1 root root  501704 Mar 30  2021 hadoop-aws-3.2.1-amzn-3.jar
            lrwxrwxrwx 1 root root      27 Sep  8 01:50 hadoop-aws.jar -> hadoop-aws-3.2.1-amzn-3.jar
            -rw-r--r-- 1 root root 4175105 Mar 30  2021 hadoop-common-3.2.1-amzn-3.jar
            lrwxrwxrwx 1 root root      30 Sep  8 01:50 hadoop-common.jar -> hadoop-common-3.2.1-amzn-3.jar

Under /usr/share/aws/aws-java-sdk/:
            -rw-r--r-- 1 root root 216879203 Apr  1  2021 aws-java-sdk-bundle-1.11.977.jar

5.  Hadoop storage:
          Use Amazon S3 for Hadoop storage instead of HDFS

6. Error log when executing spark sql create table in Zeppelin notebook:

WARN [2022-09-05 03:24:11,785] ({SchedulerFactory3} NotebookServer.java[onStatusChange]:1928) - Job paragraph_1662330571651_66787638 is finished, status: ERROR, exception: null, result: %text Fail to execute line 2: sc.sql("create table userdb_emr_search.test_table (id int, attr string)")
Traceback (most recent call last):
  File "/tmp/1662348163304-0/zeppelin_python.py", line 158, in 
    exec(code, _zcUserQueryNameSpace)
  File "", line 2, in 
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 723, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/pyspark/sql/utils.py", line 117, in deco
    raise converted from None
pyspark.sql.utils.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found)

INFO [2022-09-05 03:24:11,785] ({SchedulerFactory3} VFSNotebookRepo.java[save]:144) - Saving note 2HDK22P2Z to Untitled Note 1_2HDK22P2Z.zpln

Please help investigate why spark sql cannot see the class org.apache.hadoop.fs.s3a.S3AFileSystem even its jar files are in right place and have correct ClassPath.

Answer

Hi,

Thanks for writing to re:Post.

As I understand you are facing an issue on your spark job that is failing with the exception " java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found".

As stated in this document [1] the usage of s3a filesystem is not recommended, and the suggestion is to leverage the EMRFS (i.e. using the s3:// scheme in place the s3a:// scheme).
The EMRFS will grant you best performance, security, and reliability.

That being said, in order to solve your issue, I would suggest you to replace the uri scheme "s3a://" with "s3://". And let us know how it goes at your end.

Thanks.

[1] Work with storage and file systems - https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-file-systems.html

Executing hive create table in Spark.sql -- java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Relevanter Inhalt