Grouped Map Pandas UDFs not working correctly on EMR

Question

I have a Grouped Map Pandas UDFs which im trying to apply to a spark data frame through Pyspark spark-submit in python 2.7. However every time i run the UDF like so:  
spark_df.groupby(spark_column).apply(UDF)  
the EMR step just hangs endlessly without erroring out (even with very minimal dataset), and no useful logs in the driver or executors to tell me what is going on.  
  
Ive tested the UDF on a pandas dataframe and it works fine (UDF(spark_df.to_pandas() ) ) .  It's only when applied to the spark dataframe that it fails.  
  
Also tried this in client mode and cluster mode to no effect. HOWEVER when i run this through a Jupyter notebook shell this exact method does work. So im very curious as to why Jupyter PYSPARK handles it differently then python PYSPARK.  
  
Edited by: TKToday on Apr 16, 2020 2:06 PM

Answer

So i found the answer and putting this up for anyone who comes seeking it out.  
  
When using a Grouped Map Pandas UDF, it must be defined within the same spark session as the main function. This is especially tricky in Python, whereby spark will not react in an expected manner if you import the UDF into your main console. If you start a separate spark session define your UDF there, then import into your main session it will fail and will not error out and just run endlessly.

Grouped Map Pandas UDFs not working correctly on EMR

Contenuto pertinente