- 新しい順
- 投票が多い順
- コメントが多い順
More likely this error can happen when dealing with large datasets while they move back and forth between Spark tasks and pure Python operations. Data needs to be serialized between Spark's JVMs and Python's processes. So, in this regard my suggestion is to consider processing your datasets in separate batches. In other words, process less data per Job Run so that the Spark-to-Python data serialization doesn't take too long or fail.
I can also understand that, your Glue job is failed but the same code is working in Glue notebook. But, in general there is no such difference when using spark-session in Glue job or Glue notebook. To compare, you can run your Glue notebook as a Glue job. To get more understanding of this behavior, I would suggest you to please open a support case with AWS using the link here
Further, you can try below workaround for df.toPandas() using below spark configuration in Glue job. You can pass it as a key value pair in Glue job parameter.
Key : --conf
Value : spark.sql.execution.arrow.pyspark.enabled=true, --conf spark.driver.maxResultSize=0
Thanks that worked
Bear in mind that's not optimal, you are still bringing all the data in the driver memory and disabling the memory safety mechanism by setting it to 0
Very likely you are running out of memory by converting toPandas(), why don't you just save the csv using the DataFrame API?, even if you coalesce it to generate a single file (so it's single thread processing), it won't run out of memory.
Tried that did not worked either. well i can try various other options, but I'm puzzled how the same code works in glue notebook without adding any extra capacity.
Excellent response. I was able to get around the issue by adding the spark configuration/Glue job parameter --conf mentioned. Thanks a lot.
Good to hear. Happy to help you.
関連するコンテンツ
- 質問済み 7ヶ月前
- AWS公式更新しました 3年前
also tried coalesce(1), still resulting in same error using glue job