AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)

0
  • Glue version: 4.0
  • the Python codes that occurs the error:
df.select([col(c).cast("string") for c in df.columns]).repartition(1).write.mode('overwrite').option('header', 'true').csv(tmp_dir)
  • Error Stack:
2024-08-22 23:24:09,120 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 473, in <module>
    s3_key = csv_utils.save_output_csv_to_s3(output_df,output_s3_bucket,output_s3_folder,file_name)
  File "/tmp/utilities-1-py3-none-any.whl/utilities/csv_utils.py", line 16, in save_output_csv_to_s3
    .mode('overwrite').option('header', 'true').csv(tmp_dir)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
	...
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 35.0 failed 4 times, most recent failure: Lost task 1.3 in stage 35.0 (TID 42) (192.168.120.150 executor 2): java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
	...
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
	... 48 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 479, in <module>
    sys.exit(1)
SystemExit: 1
  • File size we are trying to save: 47 Mo
  • Please note that this error occurred only once in our environment. We tried to reproduce the same error with the same file, but it did not occur.
Chahin
asked 16 days ago29 views
1 Answer
1

Hello Chain ,

Since the issue occurred only once and you couldn't reproduce it, it's likely due to a transient network issue. Monitoring your environment for network reliability and adding retry logic in your code would be prudent steps to ensure robustness against similar future occurrences

profile picture
EXPERT
answered 16 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions