Glue Job Spark SQL Issue

0

I am trying to complete an SCD2 type load in glue job. I am currently using glue studio notebook for development. For this process I am reading in the source and target data sources. The source is an athena catalog table being read in using

input_data = glueContext.create_dynamic_frame_from_catalog(source_database, source_table)

and the target data is a RDS Aurora MySQL public instance with connection information stored in glue connections and referenced in the job parameters which I am reading in as

sbv = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                    connection_options =  {"url": target_connection_options["url"],
                                                           "user": target_connection_options["user"],
                                                           "password": target_connection_options["password"],
                                                           "dbtable": "xxxxxxxxxx"},
                                                           transformation_ctx = "sbv")

I am then converting each of the dynamic frames into data frames using the toDF() function and finally into views using the createOrReplaceTempView() function. When I execute a show() on either DF I am presented with the data, but when I try to query either of the views using something like spark.sql("select * from source").show(5) the script hangs up.

In the logs I am seeing the below warnings when I try to run these queries.

1) URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
2) NoSuchMethodException was thrown when disabling normalizeUri.

Attempted Work Arounds:

  1. Instead of reading the target into a dynamic frame I read it into a data frame using the below. That did not resolve the issue.
sbv = spark.read.format("jdbc").option("url",  target_connection_options["url"]).option("dbtable","sales_by_vehicle").option("user",target_connection_options["user"]).option("password",target_connection_options["password"]).load()

2)I have tried removing the RDS glue connection info from the job and I was able to query the view of the catalog, but could not get data out of RDS.

What is causing this response and how can I get around it? I had previously done this for a mysql to mysql glue job, but had not run into this issue. Thank you for any responses!

asked 2 years ago929 views
1 Answer
0

Hi,

not sure about the error you see in the logs, you could try to use the new job run insights feature to see if it helps pinpoint better where the error happens and which command is causing it.

Since the issue seems related only to the SparkSQL portion, could you please confirm that in the job details under advanced properties the the check box "Use Glue data catalog as Hive metastore" is indeed checked?

In Glue Studio this is the default but if by chance it got unchecked then SparkSQL will not be able to create the view and run the query.

Are you setting a database to use to create the views? If not do you have access to the "default" database?

Last question are you using a custom transform or the SQL transform?

any additional details might help. Thank you

AWS
EXPERT
answered 2 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions