EMR studio workspace error using pySpark

0

Hi,

I have a workspace that successfully attaches to a EMR (Spark cluster with applications = ["Spark", "JupyterEnterpriseGateway"] ) cluster. But when i run any commands from the notebook, i get errors like

from pyspark.sql import SparkSession

returns:

The code failed because of a fatal error:
	Session 0 unexpectedly reached final status 'dead'. See logs:
stdout: 

stderr: 
23/05/22 12:35:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/22 12:35:05 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-1-0-235.eu-west-2.compute.internal/10.1.0.235:8032
23/05/22 12:35:07 INFO Configuration: resource-types.xml not found
23/05/22 12:35:07 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/05/22 12:35:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1792 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required AM memory (1494+384 MB) is above the max threshold (1792 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
	at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:431)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:218)
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1339)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1776)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
23/05/22 12:35:07 INFO ShutdownHookManager: Shutdown hook called
23/05/22 12:35:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-5661aa05-db9f-493a-bb39-3b7ce9906c5b

YARN Diagnostics: 
spark-submit start failed.

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
asked a year ago752 views
1 Answer
0

Scaled up a bigger master and and core instance group size to get that bit working! Still get errors on different things that id except to work.

example:

The AWS spark example code:

df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')

I get error:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     60             statement_id = response["id"]
---> 61             output = self._get_statement_output(session, statement_id)
     62         except KeyboardInterrupt as e:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in _get_statement_output(self, session, statement_id)
    133                 progress.value = statement.get("progress", 0.0)
--> 134                 session.sleep(retries)
    135                 retries += 1

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in sleep(self, retries)
    362     def sleep(self, retries):
--> 363         sleep(self._policy.seconds_to_sleep(retries))
    364 

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

LivyClientTimeoutException                Traceback (most recent call last)
/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     77                     )
---> 78                     session.wait_for_idle()
     79             except:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in wait_for_idle(self, seconds_to_wait)
    336                 self.logger.error(error)
--> 337                 raise LivyClientTimeoutException(error)
    338 

LivyClientTimeoutException: Session 0 did not reach idle status in time. Current status is busy.

During handling of the above exception, another exception occurred:

SparkStatementCancellationFailedException Traceback (most recent call last)
<ipython-input-9-8c1806c06a24> in <module>
----> 1 get_ipython().run_cell_magic('spark', '', "df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')\n")

/mnt/notebook-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2417             with self.builtin_trap:
   2418                 args = (magic_arg_s, cell)
-> 2419                 result = fn(*args, **kwargs)
   2420             return result
   2421 

/mnt/notebook-env/lib/python3.7/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/mnt/notebook-env/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
    163     def wrapped(self, *args, **kwargs):
    164         try:
--> 165             out = f(self, *args, **kwargs)
    166         except Exception as err:
    167             if conf.all_errors_are_fatal():

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
    124     def wrapped(self, *args, **kwargs):
    125         try:
--> 126             out = f(self, *args, **kwargs)
    127         except exceptions_to_handle as err:
    128             if conf.all_errors_are_fatal():

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/kernels/kernelmagics.py in spark(self, line, cell, local_ns)
    483             args.samplefraction,
    484             None,
--> 485             coerce,
    486         )
    487 

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_spark(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce, output_handler, cell_kind)
    128 
    129         (success, out, mimetype) = self.spark_controller.run_command(
--> 130             Command(cell, None, cell_kind), session_name
    131         )
    132         if not success:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/sparkcontroller.py in run_command(self, command, client_name)
     39     def run_command(self, command, client_name=None):
     40         session_to_use = self.get_session_by_name_or_default(client_name)
---> 41         return command.execute(session_to_use)
     42 
     43     def run_sqlquery(self, sqlquery, client_name=None):

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     79             except:
     80                 raise SparkStatementCancellationFailedException(
---> 81                     COMMAND_CANCELLATION_FAILED_MSG
     82                 )
     83             else:

SparkStatementCancellationFailedException: Interrupted by user but Livy failed to cancel the Spark statement. The Livy session might have become unusable.

But i can run all the steps of the AWS example word-count-woth-pyspark...

Running boto3 never seems to end either:

example :

s3 = boto3.client('s3')
response = s3.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

I get an orange field in notebook and no response ever comes. soon as i press run it gives me a Last executed at but thats it.. never finishes

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions