Grouped Map Pandas UDFs not working correctly on EMR

0

I have a Grouped Map Pandas UDFs which im trying to apply to a spark data frame through Pyspark spark-submit in python 2.7. However every time i run the UDF like so:
spark_df.groupby(spark_column).apply(UDF)
the EMR step just hangs endlessly without erroring out (even with very minimal dataset), and no useful logs in the driver or executors to tell me what is going on.

Ive tested the UDF on a pandas dataframe and it works fine (UDF(spark_df.to_pandas() ) ) . It's only when applied to the spark dataframe that it fails.

Also tried this in client mode and cluster mode to no effect. HOWEVER when i run this through a Jupyter notebook shell this exact method does work. So im very curious as to why Jupyter PYSPARK handles it differently then python PYSPARK.

Edited by: TKToday on Apr 16, 2020 2:06 PM

TKToday
질문됨 4년 전405회 조회
1개 답변
0

So i found the answer and putting this up for anyone who comes seeking it out.

When using a Grouped Map Pandas UDF, it must be defined within the same spark session as the main function. This is especially tricky in Python, whereby spark will not react in an expected manner if you import the UDF into your main console. If you start a separate spark session define your UDF there, then import into your main session it will fail and will not error out and just run endlessly.

TKToday
답변함 4년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인