Grouped Map Pandas UDFs not working correctly on EMR

0

I have a Grouped Map Pandas UDFs which im trying to apply to a spark data frame through Pyspark spark-submit in python 2.7. However every time i run the UDF like so:
spark_df.groupby(spark_column).apply(UDF)
the EMR step just hangs endlessly without erroring out (even with very minimal dataset), and no useful logs in the driver or executors to tell me what is going on.

Ive tested the UDF on a pandas dataframe and it works fine (UDF(spark_df.to_pandas() ) ) . It's only when applied to the spark dataframe that it fails.

Also tried this in client mode and cluster mode to no effect. HOWEVER when i run this through a Jupyter notebook shell this exact method does work. So im very curious as to why Jupyter PYSPARK handles it differently then python PYSPARK.

Edited by: TKToday on Apr 16, 2020 2:06 PM

TKToday
asked 4 years ago395 views
1 Answer
0

So i found the answer and putting this up for anyone who comes seeking it out.

When using a Grouped Map Pandas UDF, it must be defined within the same spark session as the main function. This is especially tricky in Python, whereby spark will not react in an expected manner if you import the UDF into your main console. If you start a separate spark session define your UDF there, then import into your main session it will fail and will not error out and just run endlessly.

TKToday
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions