如何解决 Amazon EMR 上的 Spark 中的“java.lang.ClassNotFoundException”异常?
我在 Amazon EMR 上使用 spark-submit 或 PySpark 任务中的自定义 JAR 文件时,收到 java.lang.ClassNotFoundException 错误。
简短描述
如果满足以下任一条件,则会发生此错误:
- spark-submit 任务无法在类路径中找到相关文件。
- 引导操作或自定义配置覆盖了类路径。发生此情况时,类加载器会仅选择您在配置中指定的位置中存在的 JAR 文件。
解决方法
检查堆栈跟踪以查找缺失类的名称。然后,将您的自定义 JAR 的路径(包含缺失的类)添加到 Spark 类路径中。您可以在集群运行、启动新集群或者提交任务时执行此操作。
正在运行的集群上
在 /etc/spark/conf/spark-defaults.conf 中,将您的自定义 JAR 的路径附加到错误堆栈跟踪中指定的类名称。在以下示例中,/home/hadoop/extrajars/* 是自定义 JAR 路径。
sudo vim /etc/spark/conf/spark-defaults.conf spark.driver.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/* spark.executor.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*
在新集群上
在创建集群时提供配置对象,从而在 /etc/spark/conf/spark-defaults.conf 中将自定义 JAR 路径附加到现有类路径中。
**注意:**要使用此选项,您必须使用 Amazon EMR 发行版本 5.14.0 或更高版本创建集群。
对于 Amazon EMR 5.14.0 到 Amazon EMR 5.17.0,请包含以下内容:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.17.0 到 Amazon EMR 5.18.0,请添加 /usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar 作为其他 JAR 路径:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.19.0 到 Amazon EMR 5.32.0,请按如下方式更新 JAR 路径:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*" } } ]
对于 Amazon EMR 5.33.0 到 Amazon EMR 5.34.0,请按如下方式更新 JAR 路径:
[ { "Classification": "spark-defaults", "Properties": { "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/", "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/" } } ]
对于 Amazon EMR 发行版本 6.0.0 及更高版本,使用配置更新 JAR 路径不起作用。对于这些版本,.conf 文件包含多个 jar 文件路径。此外,您要更新的每个属性的配置长度不能超过 1024 个字符。但是,您可以添加引导操作以将自定义 JAR 位置传递给 spark-defaults.conf。有关更多信息,请参阅如何在引导阶段之后更新所有的 Amazon EMR 节点?
创建与以下类似的 bash 脚本:
注意:
- 请务必将 s3://doc-example-bucket/Bootstraps/script_b.sh 替换为您选择的 Amazon Simple Storage Service(Amazon S3)路径。
- 请务必将 /home/hadoop/extrajars/* 替换为您的自定义 JAR 文件路径。
- 请确保 Amazon EMR 执行角色拥有访问此 S3 存储桶的权限。
#!/bin/bash # # This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf # while [ ! -f /etc/spark/conf/spark-defaults.conf ] do sleep 1 done # # Now the file is available, do your work here # sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf exit 0
启动 EMR 集群并添加类似以下内容的引导操作:
#!/bin/bash pwd aws s3 cp s3://doc-example-bucket/Bootstraps/script_b.sh . chmod +x script_b.sh nohup ./script_b.sh &
对于单个任务
使用 --jars 选项在您运行 spark-submit 时传入自定义 JAR 路径。
示例:
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi --master yarn spark-examples.jar 100 --jars /home/hadoop/extrajars/*
**注意:**要防止类冲突时,请勿在使用 --jars 选项时包含标准 JAR。例如,请勿包含 spark-core.jar,因为它已经存在于集群中。
有关配置分类的更多信息,请参阅配置 Spark。
相关信息
相关内容
- AWS 官方已更新 1 年前
- AWS 官方已更新 2 年前