AWS Glue job/notebook access to Spark package (Apache Sedona, etc.)

1

Using AWS Glue (notebooks for testing and jobs for production), how can Spark packages such as Apache Sedona be set up properly?

For reference, on a local machine, the following are needed to use the Apache Sedona package with Apache Spark:

  • pip install apache-sedona
  • Spark session config settings:
        .config(
            "spark.jars.packages",
            "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.3.1-incubating,org.datasyslab:geotools-wrapper:1.3.0-27.2",
        )
        .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
        .config("spark.kryo.registrator", "org.apache.sedona.core.serde.SedonaKryoRegistrator")
  • Imports and registration
from sedona.register import SedonaRegistrator

SedonaRegistrator.registerAll(spark)

With Glue notebooks, I have tried using the %spark_conf option. With Glue jobs, I have tried use --spark-conf and --extra-jars.

Help is appreciated as Glue has very minimal documentation.

  • @Jaime can you please expand on which versions of Glue/Spark, Apache Sedona and additional jars you used? I'm trying the exact same thing in a Glue Job 4.0 Saprk 3.3 with Sedona 1.15.0, which I installed using the job parameters:

    --additional-python-modules | apache-sedona==1.5.0

    --extra-jars | s3://<bucket>/jars/sedona-spark-shaded-3.0_2.12-1.5.0.jar,s3://<bucket>/jars/geotools-wrapper-1.5.0-28.2.jar

    But can't seem to get it to work for even starting the SedonaContext, though I don't get any import errors.

    from sedona.spark import *
    
    config = SedonaContext.builder() .\
        config('spark.jars.packages',
               's3://<bucket>/jars/sedona-spark-shaded-3.0_2.12-1.5.0.jar,'
               's3://<bucket>/jars/geotools-wrapper-1.5.0-28.2.jar'). \
        getOrCreate()
    

    For which I get: NameError: name 'SedonaContext' is not defined Your help would me most appreciated!

profile picture
Jaime
asked a year ago1482 views
3 Answers
1

Glue doesn't allow dynamic loading of packages using "spark.jars.packages".
To add dependencies need to use the magics %additional_python_modules and %extra_jars (more info https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html)

In the case of Python you can reference directly to pip modules but in the case of the jars, it doesn't accept maven coordinates, unfortunately, you need to get the jars, put then on s3 and then reference then using %extra_jars.

profile pictureAWS
EXPERT
answered a year ago
  • Thank you for this helpful info., Gonzalo.

    As far as Glue interactive notebook "magics," is it a bug that %additional_python_modules only allows for a single package and not multiple that are separated by commas. Also in the notebook, it does not allow for installing a package and specifying a custom index URL (even wrapped in double quotes) because it complains about the space.

  • And in a Glue job, the --extra-jars param only allows for 256 characters, which is only enough for one or two jar s3 paths.

  • Also, --extra-jars key and value disappear after saving the Glue job (in the console) and running the job. When I go to try again, the param is gone.

  • The python modules parameter expects modules (an optionally versions), I believe there was another parameter when using jobs to specify a custom index but don't remember. The reason --extra-jars disappears in the job is that it has it's own configuration box and it's moved there. Maybe the magic has a length limitation, I've seen people adds lots of jars directly in jobs. Most plugins offer a "uberjar" version for convenience or you can build one yourself.

  • Gonzalo, I was able to get this to work in my notebook using the following:

    • %spark_conf (only allows for a single value and no comma-separated)
    • %extra_jars (comma separated s3 paths directly to two jars; .zip did not work)

    However, the same notebook being run as a job fails due to not finding the sedona module. Is there a bug perhaps in. AWS Glue jobs. If I use the equivalent params in a separate test job, it also fails to find the spark jars.

    • --spark-conf
    • --extra-jars
0

Better avoid using %spark_conf for libraries, Glue has a special way of handling dependencies. For the job use --extra-jars for Java/Scala jars and --extra-py-files for Python zip files. More info here: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html

profile pictureAWS
EXPERT
answered a year ago
  • --extra-py-files totally worked for the Sedona-related .jar files I downloaded. Thank you so much!

    The primary things left to figure out are

    • custom spark conf settings (e.g., spark.serializer=org.apache.spark.serializer.KryoSerializer, spark.logConf=true", spark.sql.sources.partitionOverwriteMode=dynamic, etc.)
    • installing sedona works via --additional-python-modules but what about private packages with a custom index URL?
  • My preferred option to set that is using SparkSession.builder (since the serializer cannot be changed once the session is create) but you can also use --conf or the magic if you want. I think you can do custom index using --python-modules-installer-option (see further info in the docs), remember you can always download the whl to s3 and reference it there.

  • When requiring multiple spark config settings in a Glue job, the following works. However, I am not aware of how to specify multiple spark config settings in a Glue notebook. In a Glue notebook, %spark_conf works but only for a single config setting.

        spark_context = SparkContext.getOrCreate()
        glue_context = GlueContext(spark_context)
    
        spark_session = (
            glue_context.spark_session.builder.appName(<session_name>)
            .config("spark.logConf", "true")
            .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
            .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
            .config(
                "spark.kryo.registrator",
                "org.apache.sedona.core.serde.SedonaKryoRegistrator",
            )
            .enableHiveSupport()
            .getOrCreate()
        )
    
0

Finally I made it - The only option to set multiple configurations is to concatenate the string the same way as spark-shell, for example -

%spark_conf spark.sql.shuffle.partitions=4 --conf spark.executor.memory=2g

Other alternatives suggested above just don't affect.

Maor
answered 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions