1 Answer
- Newest
- Most votes
- Most comments
1
Hello Harish, The observation that you have experienced is an expected behavior. I have tried the below and one hack that you can do is
#!/usr/bin/python
import sys
import os
sys.path.append('/usr/lib/spark/python/lib/pyspark.zip')
sys.path.append('/usr/lib/spark/python/lib/py4j-src.zip')
os.environ['SPARK_HOME'] = '/usr/lib/spark'
import pyspark.sql.types as spark_type
import pyspark.sql.functions as spark_func
from pyspark.sql import Row
from pyspark.sql import SparkSession
My tests:
- in EMR master node, created script
test.py
[hadoop@ip-172-31-41-141 ~]$ cat test.py
#!/usr/bin/python
import sys
import os
sys.path.append('/usr/lib/spark/python/lib/pyspark.zip')
sys.path.append('/usr/lib/spark/python/lib/py4j-src.zip')
os.environ['SPARK_HOME'] = '/usr/lib/spark'
import pyspark.sql.types as spark_type
import pyspark.sql.functions as spark_func
from pyspark.sql import Row
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('pythonSpark') \
.enableHiveSupport() \
.getOrCreate()
data = [("Java", "20000"), ("Python", "100000"), ("Scala", "3000")]
df = spark.createDataFrame(data)
df.show()
- From notebook
- From YARN RM UI
The reason is the notebook is run on JupyterEnterpriseGateway (JEG) and EMR cluster is accessed via livy.
In many cases %run
is being used to execute a different notebook see here instead of directly calling the python file.
But, generally with EMR its recommend to use %execute_notebook to execute ipynb files
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a month ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 months ago