AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)

0
  • Glue version: 4.0
  • the Python codes that occurs the error:
df.select([col(c).cast("string") for c in df.columns]).repartition(1).write.mode('overwrite').option('header', 'true').csv(tmp_dir)
  • Error Stack:
2024-08-22 23:24:09,120 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 473, in <module>
    s3_key = csv_utils.save_output_csv_to_s3(output_df,output_s3_bucket,output_s3_folder,file_name)
  File "/tmp/utilities-1-py3-none-any.whl/utilities/csv_utils.py", line 16, in save_output_csv_to_s3
    .mode('overwrite').option('header', 'true').csv(tmp_dir)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
	...
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 35.0 failed 4 times, most recent failure: Lost task 1.3 in stage 35.0 (TID 42) (192.168.120.150 executor 2): java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
	...
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
	... 48 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 479, in <module>
    sys.exit(1)
SystemExit: 1
  • File size we are trying to save: 47 Mo
  • Please note that this error occurred only once in our environment. We tried to reproduce the same error with the same file, but it did not occur.
Chahin
posta 2 mesi fa72 visualizzazioni
1 Risposta
1

Hello Chain ,

Since the issue occurred only once and you couldn't reproduce it, it's likely due to a transient network issue. Monitoring your environment for network reliability and adding retry logic in your code would be prudent steps to ensure robustness against similar future occurrences

profile picture
ESPERTO
con risposta 2 mesi fa

Accesso non effettuato. Accedi per postare una risposta.

Una buona risposta soddisfa chiaramente la domanda, fornisce un feedback costruttivo e incoraggia la crescita professionale del richiedente.

Linee guida per rispondere alle domande