Environment variables not found in Spark env with EMR 6.x

2

Hi,

We are currently using EMR 5.x with Spark 2.4.x.
We are running pyspark jobs and it's working fine.

I tried multiple times to start using EMR 6.x and the spark jobs are always failing to read environment variables that are set during cluster creation.

Inside the run job flow configurations, we have something like:
{
"Classification": "hadoop-env",
"Configurations": [
{"Classification": "export", "Properties": {"someKey": "someValue"}}
],
}
I've tried with "Classification": "spark-env" as well but it still does not work.
The pyspark job always returns a KeyError that we don't have in EMR 5.x.

I tried already back in spring 2020, but I write now as we would like to be able to use Spark 3.x from now on.

I hope someone can help us.
Best regards,
Claude

Edited by: claude8 on Oct 9, 2020 4:06 AM

Edited by: claude8 on Oct 11, 2020 10:26 PM

claude8
asked 4 years ago2626 views
1 Answer
2

I found how to fix this issue.
I had to use the "spark-defaults" classifications and in the properties have something like {"spark.yarn.appMasterEnv.YOUR_ENV_VARIABLE": "the value"}.
This is explained in the Spark documentation (https://spark.apache.org/docs/latest/configuration.html#environment-variables), however I don't understand why it was working differently in EMR 5.x

claude8
answered 3 years ago
  • Ran into an issue with EMR 6.60 where the encoding of Jupyter Spark was for some reason in ASCII instead of UTF8 like in Zeppelin and others. The problem only existed in Jupyter for whatever reason. The settings I used for EMR 5.x for the same issue stopped working, using your suggest of spark-defaults fixed the issue. Thanks!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions