EMR studio workspace error using pySpark

0

Hi,

I have a workspace that successfully attaches to a EMR (Spark cluster with applications = ["Spark", "JupyterEnterpriseGateway"] ) cluster. But when i run any commands from the notebook, i get errors like

from pyspark.sql import SparkSession

returns:

The code failed because of a fatal error:
	Session 0 unexpectedly reached final status 'dead'. See logs:
stdout: 

stderr: 
23/05/22 12:35:03 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
23/05/22 12:35:05 INFO DefaultNoHARMFailoverProxyProvider: Connecting to ResourceManager at ip-10-1-0-235.eu-west-2.compute.internal/10.1.0.235:8032
23/05/22 12:35:07 INFO Configuration: resource-types.xml not found
23/05/22 12:35:07 INFO ResourceUtils: Unable to find 'resource-types.xml'.
23/05/22 12:35:07 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (1792 MB per container)
Exception in thread "main" java.lang.IllegalArgumentException: Required AM memory (1494+384 MB) is above the max threshold (1792 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
	at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:431)
	at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:218)
	at org.apache.spark.deploy.yarn.Client.run(Client.scala:1339)
	at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1776)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:1006)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1095)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1104)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
23/05/22 12:35:07 INFO ShutdownHookManager: Shutdown hook called
23/05/22 12:35:07 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-5661aa05-db9f-493a-bb39-3b7ce9906c5b

YARN Diagnostics: 
spark-submit start failed.

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
posta un anno fa812 visualizzazioni
1 Risposta
0

Scaled up a bigger master and and core instance group size to get that bit working! Still get errors on different things that id except to work.

example:

The AWS spark example code:

df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')

I get error:

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     60             statement_id = response["id"]
---> 61             output = self._get_statement_output(session, statement_id)
     62         except KeyboardInterrupt as e:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in _get_statement_output(self, session, statement_id)
    133                 progress.value = statement.get("progress", 0.0)
--> 134                 session.sleep(retries)
    135                 retries += 1

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in sleep(self, retries)
    362     def sleep(self, retries):
--> 363         sleep(self._policy.seconds_to_sleep(retries))
    364 

KeyboardInterrupt: 

During handling of the above exception, another exception occurred:

LivyClientTimeoutException                Traceback (most recent call last)
/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     77                     )
---> 78                     session.wait_for_idle()
     79             except:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/livysession.py in wait_for_idle(self, seconds_to_wait)
    336                 self.logger.error(error)
--> 337                 raise LivyClientTimeoutException(error)
    338 

LivyClientTimeoutException: Session 0 did not reach idle status in time. Current status is busy.

During handling of the above exception, another exception occurred:

SparkStatementCancellationFailedException Traceback (most recent call last)
<ipython-input-9-8c1806c06a24> in <module>
----> 1 get_ipython().run_cell_magic('spark', '', "df = spark.read.parquet('s3://amazon-reviews-pds/parquet/product_category=Books/*.parquet')\n")

/mnt/notebook-env/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_cell_magic(self, magic_name, line, cell)
   2417             with self.builtin_trap:
   2418                 args = (magic_arg_s, cell)
-> 2419                 result = fn(*args, **kwargs)
   2420             return result
   2421 

/mnt/notebook-env/lib/python3.7/site-packages/decorator.py in fun(*args, **kw)
    230             if not kwsyntax:
    231                 args, kw = fix(args, kw, sig)
--> 232             return caller(func, *(extras + args), **kw)
    233     fun.__name__ = func.__name__
    234     fun.__doc__ = func.__doc__

/mnt/notebook-env/lib/python3.7/site-packages/IPython/core/magic.py in <lambda>(f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
    163     def wrapped(self, *args, **kwargs):
    164         try:
--> 165             out = f(self, *args, **kwargs)
    166         except Exception as err:
    167             if conf.all_errors_are_fatal():

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/exceptions.py in wrapped(self, *args, **kwargs)
    124     def wrapped(self, *args, **kwargs):
    125         try:
--> 126             out = f(self, *args, **kwargs)
    127         except exceptions_to_handle as err:
    128             if conf.all_errors_are_fatal():

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/kernels/kernelmagics.py in spark(self, line, cell, local_ns)
    483             args.samplefraction,
    484             None,
--> 485             coerce,
    486         )
    487 

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/magics/sparkmagicsbase.py in execute_spark(self, cell, output_var, samplemethod, maxrows, samplefraction, session_name, coerce, output_handler, cell_kind)
    128 
    129         (success, out, mimetype) = self.spark_controller.run_command(
--> 130             Command(cell, None, cell_kind), session_name
    131         )
    132         if not success:

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/sparkcontroller.py in run_command(self, command, client_name)
     39     def run_command(self, command, client_name=None):
     40         session_to_use = self.get_session_by_name_or_default(client_name)
---> 41         return command.execute(session_to_use)
     42 
     43     def run_sqlquery(self, sqlquery, client_name=None):

/mnt/notebook-env/lib/python3.7/site-packages/sparkmagic/livyclientlib/command.py in execute(self, session)
     79             except:
     80                 raise SparkStatementCancellationFailedException(
---> 81                     COMMAND_CANCELLATION_FAILED_MSG
     82                 )
     83             else:

SparkStatementCancellationFailedException: Interrupted by user but Livy failed to cancel the Spark statement. The Livy session might have become unusable.

But i can run all the steps of the AWS example word-count-woth-pyspark...

Running boto3 never seems to end either:

example :

s3 = boto3.client('s3')
response = s3.list_buckets()

# Output the bucket names
print('Existing buckets:')
for bucket in response['Buckets']:
    print(f'  {bucket["Name"]}')

I get an orange field in notebook and no response ever comes. soon as i press run it gives me a Last executed at but thats it.. never finishes

con risposta un anno fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande