如何解決 Amazon EMR 上 Spark 中的「java.lang.ClassNotFoundException」?

2 分的閱讀內容
0

我在 Amazon EMR 上的 spark-submit 或 PySpark 作業中使用自訂 JAR 檔案時,收到 java.lang.ClassNotFoundException 錯誤。

簡短描述

當下列任一條件成立時,就會發生這個錯誤:

  • spark-submit 作業在類別路徑中找不到相關檔案。
  • 啟動程序動作或自訂組態會覆寫類別路徑。發生這種情況時,類別載入器只會選取您在組態中指定的位置中存在的 JAR 檔案。

解決方法

檢查堆疊追蹤以尋找缺少類別的名稱。然後,將自訂 JAR (包含缺少的類別) 的路徑新增至 Spark 類別路徑中。您可以在叢集執行時、啟動新叢集或提交作業時執行此操作。

在執行的叢集上

/etc/spark/conf/spark-defaults.conf 中,將自訂 JAR 的路徑附加到錯誤堆疊追蹤中指定的類別名稱。在下列範例中,/home/hadoop/extrajars/* 是自訂 JAR 路徑。

sudo vim /etc/spark/conf/spark-defaults.conf
spark.driver.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*
spark.executor.extraClassPath <other existing jar locations>:/home/hadoop/extrajars/*

在新叢集上

在建立叢集時提供組態物件,將自訂 JAR 路徑附加至 /etc/spark/conf/spark-defaults.conf 中現有的類別路徑。

注意: 若要使用此選項,您必須使用 Amazon EMR 版本 5.14.0 或更新版本建立叢集。

若是 Amazon EMR 5.14.0 至 Amazon EMR 5.17.0,請納入以下內容:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*"
    }
  }
]

若是 Amazon EMR 5.17.0 至 Amazon EMR 5.18.0,請將 /usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar 納為額外的 JAR 路徑:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

若是 Amazon EMR 5.19.0 至 Amazon EMR 5.32.0,請按下方所示更新 JAR 路徑:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/*"
    }
  }
]

若是 Amazon EMR 5.33.0 至 Amazon EMR 5.34.0,請按下方所示更新 JAR 路徑:

[
  {
    "Classification": "spark-defaults",
    "Properties": {
      "spark.driver.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/",
      "spark.executor.extraClassPath": "/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/goodies/lib/emr-spark-goodies.jar:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/usr/share/aws/emr/s3select/lib/emr-s3-select-spark-connector.jar:/home/hadoop/extrajars/"
    }
  }
]

若是 Amazon EMR 版本 6.0.0 及更新版本,使用組態更新 JAR 路徑病無法運作。在這些版本中,.conf 檔案包含數個 jar 檔案路徑。此外,您要更新的每個屬性之組態長度不能超過 1024 個字元。不過,您可以新增啟動程序動作,將自訂 JAR 位置傳遞至 spark-defaults.conf。如需詳細資訊,請參閱如何在啟動程序階段之後更新所有 Amazon EMR 節點?

建立類似以下內容的 bash 指令碼:

注意:

  • 請務必將 s3://doc-example-bucket/Bootstraps/script_b.sh 替換為您選擇的 Amazon Simple Storage Service (Amazon S3) 路徑。
  • 務必將 /home/hadoop/extrajars/* 替換為您的自訂 JAR 檔案路徑。
  • 請確定 Amazon EMR 執行角色具有存取此 S3 儲存貯體的權限。
#!/bin/bash
#
# This is an example of script_b.sh for changing /etc/spark/conf/spark-defaults.conf
#
while [ ! -f /etc/spark/conf/spark-defaults.conf ]
do
  sleep 1
done
#
# Now the file is available, do your work here
#
sudo sed -i '/spark.*.extraClassPath/s/$/:\/home\/hadoop\/extrajars\/\*/' /etc/spark/conf/spark-defaults.conf
exit 0

啟動 EMR 叢集並新增類似下列內容的啟動程序動作

#!/bin/bash
pwd
aws s3 cp s3://doc-example-bucket/Bootstraps/script_b.sh .
chmod +x script_b.sh
nohup ./script_b.sh &

對於單一作業

在執行 spark-submit 時,使用 --jars 選項來傳遞自訂 JAR 路徑。

範例:

spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi --master yarn spark-examples.jar 100 --jars /home/hadoop/extrajars/*

注意: 為防止類別衝突,請勿在使用 --jars 選項時納入標準 JAR。例如,請勿納入 ** spark-core.jar**,因為它已存在於叢集中。

如需設定分類的詳細資訊,請參閱設定 Spark


相關資訊

Spark 組態

如何解決 Amazon EMR 上 Spark 中的錯誤「容器因超出記憶體限制而被 YARN 停止」?

AWS 官方
AWS 官方已更新 2 年前