AWS Glue Pyspark job is not able to save a Dataframe as csv format into an S3 Bucket (error `py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv`)

0
  • Glue version: 4.0
  • the Python codes that occurs the error:
df.select([col(c).cast("string") for c in df.columns]).repartition(1).write.mode('overwrite').option('header', 'true').csv(tmp_dir)
  • Error Stack:
2024-08-22 23:24:09,120 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(77)): Error from Python:Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 473, in <module>
    s3_key = csv_utils.save_output_csv_to_s3(output_df,output_s3_bucket,output_s3_folder,file_name)
  File "/tmp/utilities-1-py3-none-any.whl/utilities/csv_utils.py", line 16, in save_output_csv_to_s3
    .mode('overwrite').option('header', 'true').csv(tmp_dir)
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 1240, in csv
    self._jwrite.csv(path)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
    return_value = get_return_value(
  File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 190, in deco
    return f(*a, **kw)
  File "/opt/amazon/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
    raise Py4JJavaError(
py4j.protocol.Py4JJavaError: An error occurred while calling o1257.csv.
: org.apache.spark.SparkException: Job aborted.
	at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:638)
	...
	at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 35.0 failed 4 times, most recent failure: Lost task 1.3 in stage 35.0 (TID 42) (192.168.120.150 executor 2): java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2863)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2799)
	...
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:209)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:213)
	... 48 more
Caused by: java.net.SocketException: Connection reset
	at java.net.SocketInputStream.read(SocketInputStream.java:210)
	...
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/tmp/find-existing-patients-job.script.py", line 479, in <module>
    sys.exit(1)
SystemExit: 1
  • File size we are trying to save: 47 Mo
  • Please note that this error occurred only once in our environment. We tried to reproduce the same error with the same file, but it did not occur.
Chahin
preguntada hace un mes63 visualizaciones
1 Respuesta
1

Hello Chain ,

Since the issue occurred only once and you couldn't reproduce it, it's likely due to a transient network issue. Monitoring your environment for network reliability and adding retry logic in your code would be prudent steps to ensure robustness against similar future occurrences

profile picture
EXPERTO
respondido hace un mes

No has iniciado sesión. Iniciar sesión para publicar una respuesta.

Una buena respuesta responde claramente a la pregunta, proporciona comentarios constructivos y fomenta el crecimiento profesional en la persona que hace la pregunta.

Pautas para responder preguntas