AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)

0
  • Glue version: 4.0
  • the Python codes that occurs the error:
df.select([col(c).cast("string") for c in df.columns]).repartition(1).write.mode('overwrite').option('header', 'true').csv(tmp_dir)
  • Error Stack:
2024-08-22 23:24:09,120 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 473, in <module>
    s3_key = csv_utils.save_output_csv_to_s3(output_df,output_s3_bucket,output_s3_folder,file_name)
  File "/tmp/utilities-1-py3-none-any.whl/utilities/csv_utils.py", line 16, in save_output_csv_to_s3
    .mode('overwrite').option('header', 'true').csv(tmp_dir)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
	...
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 35.0 failed 4 times, most recent failure: Lost task 1.3 in stage 35.0 (TID 42) (192.168.120.150 executor 2): java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
	...
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
	... 48 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 479, in <module>
    sys.exit(1)
SystemExit: 1
  • File size we are trying to save: 47 Mo
  • Please note that this error occurred only once in our environment. We tried to reproduce the same error with the same file, but it did not occur.
Chahin
已提問 1 個月前檢視次數 63 次
1 個回答
1

Hello Chain ,

Since the issue occurred only once and you couldn't reproduce it, it's likely due to a transient network issue. Monitoring your environment for network reliability and adding retry logic in your code would be prudent steps to ensure robustness against similar future occurrences

profile picture
專家
已回答 1 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南