Glue Job Spark SQL Issue

0

I am trying to complete an SCD2 type load in glue job. I am currently using glue studio notebook for development. For this process I am reading in the source and target data sources. The source is an athena catalog table being read in using

input_data = glueContext.create_dynamic_frame_from_catalog(source_database, source_table)

and the target data is a RDS Aurora MySQL public instance with connection information stored in glue connections and referenced in the job parameters which I am reading in as

sbv = glueContext.create_dynamic_frame.from_options(connection_type="mysql",
                                    connection_options =  {"url": target_connection_options["url"],
                                                           "user": target_connection_options["user"],
                                                           "password": target_connection_options["password"],
                                                           "dbtable": "xxxxxxxxxx"},
                                                           transformation_ctx = "sbv")

I am then converting each of the dynamic frames into data frames using the toDF() function and finally into views using the createOrReplaceTempView() function. When I execute a show() on either DF I am presented with the data, but when I try to query either of the views using something like spark.sql("select * from source").show(5) the script hangs up.

In the logs I am seeing the below warnings when I try to run these queries.

1) URL.setURLStreamHandlerFactory failed to set FsUrlStreamHandlerFactory
2) NoSuchMethodException was thrown when disabling normalizeUri.

Attempted Work Arounds:

  1. Instead of reading the target into a dynamic frame I read it into a data frame using the below. That did not resolve the issue.
sbv = spark.read.format("jdbc").option("url",  target_connection_options["url"]).option("dbtable","sales_by_vehicle").option("user",target_connection_options["user"]).option("password",target_connection_options["password"]).load()

2)I have tried removing the RDS glue connection info from the job and I was able to query the view of the catalog, but could not get data out of RDS.

What is causing this response and how can I get around it? I had previously done this for a mysql to mysql glue job, but had not run into this issue. Thank you for any responses!

已提问 2 年前990 查看次数
1 回答
0

Hi,

not sure about the error you see in the logs, you could try to use the new job run insights feature to see if it helps pinpoint better where the error happens and which command is causing it.

Since the issue seems related only to the SparkSQL portion, could you please confirm that in the job details under advanced properties the the check box "Use Glue data catalog as Hive metastore" is indeed checked?

In Glue Studio this is the default but if by chance it got unchecked then SparkSQL will not be able to create the view and run the query.

Are you setting a database to use to create the views? If not do you have access to the "default" database?

Last question are you using a custom transform or the SQL transform?

any additional details might help. Thank you

AWS
专家
已回答 2 年前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则