I need to run Java code in my (PySpark) Glue job and I tried to follow the tutorial https://xctorres.github.io/2021/11/07/jar-in-pyspark/. (Btw. there seems to be no Glue-specific instruction or documentation how to work with JARs configured as dependent JARs in the Glue job details.)
I uploaded my JAR to S3 and configured the S3 URI as dependent JAR in my job details. I also checked with print(spark_context.getConf().getAll())
, that it appears as spark.glue.extra-jars in the spark config.
But when I try to register my UDF as
spark_session.udf.registerJavaFunction(name="square_test", javaClassName="SquareTest", returnType=pyspark.sql.types.IntegerType())
I get the error
AnalysisException: Can not load class SquareTest, please make sure it is on the classpath.
How do I add my JAR to the classpath?
Here is my Java code:
import org.apache.spark.sql.api.java.UDF1;
public class SquareTest implements UDF1<Integer, Integer>{
@Override
public Integer call(Integer number) throws Exception {
return number*number;
}
}
and this is how I compiled it:
javac -classpath lib/spark-sql_2.12-3.4.1.jar -d ./build SquareTest.java
jar cvf SquareTest.jar build/*
Unfortunately, this does not solve my problem. I get the same error:
In what package is your Java Class?
Maybe it is not on
org.apache.spark.sql.api.java
package@acmanjon, I changed my Java code to
and the python registration to
but I get the same error:
Indeed, it was a problem with my packaging. It now works in the latest version of my code from above and with the JAR looking like
Thanks for your help and fast replies, @acmanjon and Gonzalo Herreros!