Set classpath for dependent JARS in AWS Glue

0

I need to run Java code in my (PySpark) Glue job and I tried to follow the tutorial https://xctorres.github.io/2021/11/07/jar-in-pyspark/. (Btw. there seems to be no Glue-specific instruction or documentation how to work with JARs configured as dependent JARs in the Glue job details.)

I uploaded my JAR to S3 and configured the S3 URI as dependent JAR in my job details. I also checked with print(spark_context.getConf().getAll()), that it appears as spark.glue.extra-jars in the spark config.

But when I try to register my UDF as

spark_session.udf.registerJavaFunction(name="square_test", javaClassName="SquareTest", returnType=pyspark.sql.types.IntegerType())

I get the error

AnalysisException: Can not load class SquareTest, please make sure it is on the classpath.

How do I add my JAR to the classpath?

Here is my Java code:

import org.apache.spark.sql.api.java.UDF1;

public class SquareTest implements UDF1<Integer, Integer>{

    @Override
    public Integer call(Integer number) throws Exception {
        return number*number;
    }
}

and this is how I compiled it:

javac -classpath lib/spark-sql_2.12-3.4.1.jar -d ./build SquareTest.java
jar cvf SquareTest.jar build/*
asked 10 months ago1056 views
1 Answer
0
Accepted Answer

According the Spark documentation, javaClassName is the fully qualified class name but you are only specifying the classname.
Try with javaClassName ="org.apache.spark.sql.api.java.UDF1.SquareTest"

profile pictureAWS
EXPERT
answered 10 months ago
  • Unfortunately, this does not solve my problem. I get the same error:

    2023-07-20 10:21:23,724 ERROR [main] glue.ProcessLauncher (Logging.scala:logError(73)): Error from Python:Traceback (most recent call last):
      File "/tmp/PySparkTesting2.py", line 37, in <module>
        spark_session.udf.registerJavaFunction(name="square_test", javaClassName="org.apache.spark.sql.api.java.UDF1.SquareTest", returnType=pyspark.sql.types.IntegerType())
      File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 409, in registerJavaFunction
        self.sparkSession._jsparkSession.udf().registerJava(name, javaClassName, jdt)
      File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/opt/amazon/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
        raise converted from None
    pyspark.sql.utils.AnalysisException: Can not load class org.apache.spark.sql.api.java.UDF1.SquareTest, please make sure it is on the classpath
    
  • In what package is your Java Class?

    Maybe it is not on org.apache.spark.sql.api.java package

  • @acmanjon, I changed my Java code to

    package mytest;
    
    import org.apache.spark.sql.api.java.UDF1;
    
    public class SquareTest implements UDF1<Integer, Integer>{
    
        @Override
        public Integer call(Integer number) throws Exception {
            return number*number;
        }
    }
    

    and the python registration to

    spark_session.udf.registerJavaFunction(name="square_test", javaClassName="mytest.SquareTest", returnType=pyspark.sql.types.IntegerType())
    

    but I get the same error:

    AnalysisException: Can not load class mytest.SquareTest, please make sure it is on the classpath.
    
  • Indeed, it was a problem with my packaging. It now works in the latest version of my code from above and with the JAR looking like

    $ jar -tf SquareTest.jar
    META-INF/
    META-INF/MANIFEST.MF
    mytest/
    mytest/SquareTest.class
    

    Thanks for your help and fast replies, @acmanjon and Gonzalo Herreros!

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions