Skip to content

AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)

0
  • Glue version: 4.0
  • the Python codes that occurs the error:
df.select([col(c).cast("string") for c in df.columns]).repartition(1).write.mode('overwrite').option('header', 'true').csv(tmp_dir)
  • Error Stack:
2024-08-22 23:24:09,120 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 473, in <module>
    s3_key = csv_utils.save_output_csv_to_s3(output_df,output_s3_bucket,output_s3_folder,file_name)
  File "/tmp/utilities-1-py3-none-any.whl/utilities/csv_utils.py", line 16, in save_output_csv_to_s3
    .mode('overwrite').option('header', 'true').csv(tmp_dir)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
	...
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 35.0 failed 4 times, most recent failure: Lost task 1.3 in stage 35.0 (TID 42) (192.168.120.150 executor 2): java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
	...
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
	... 48 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 479, in <module>
    sys.exit(1)
SystemExit: 1
  • File size we are trying to save: 47 Mo
  • Please note that this error occurred only once in our environment. We tried to reproduce the same error with the same file, but it did not occur.
asked a year ago459 views
1 Answer
1

Hello Chain ,

Since the issue occurred only once and you couldn't reproduce it, it's likely due to a transient network issue. Monitoring your environment for network reliability and adding retry logic in your code would be prudent steps to ensure robustness against similar future occurrences

EXPERT
answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.