Environment variables not found in Spark env with EMR 6.x

2

Hi,

We are currently using EMR 5.x with Spark 2.4.x.
We are running pyspark jobs and it's working fine.

I tried multiple times to start using EMR 6.x and the spark jobs are always failing to read environment variables that are set during cluster creation.

Inside the run job flow configurations, we have something like:
{
"Classification": "hadoop-env",
"Configurations": [
{"Classification": "export", "Properties": {"someKey": "someValue"}}
],
}
I've tried with "Classification": "spark-env" as well but it still does not work.
The pyspark job always returns a KeyError that we don't have in EMR 5.x.

I tried already back in spring 2020, but I write now as we would like to be able to use Spark 3.x from now on.

I hope someone can help us.
Best regards,
Claude

Edited by: claude8 on Oct 9, 2020 4:06 AM

Edited by: claude8 on Oct 11, 2020 10:26 PM

claude8
질문됨 4년 전2658회 조회
1개 답변
2

I found how to fix this issue.
I had to use the "spark-defaults" classifications and in the properties have something like {"spark.yarn.appMasterEnv.YOUR_ENV_VARIABLE": "the value"}.
This is explained in the Spark documentation (https://spark.apache.org/docs/latest/configuration.html#environment-variables), however I don't understand why it was working differently in EMR 5.x

claude8
답변함 3년 전
  • Ran into an issue with EMR 6.60 where the encoding of Jupyter Spark was for some reason in ASCII instead of UTF8 like in Zeppelin and others. The problem only existed in Jupyter for whatever reason. The settings I used for EMR 5.x for the same issue stopped working, using your suggest of spark-defaults fixed the issue. Thanks!

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인