Output from Athena to RDS

0

I need to build a solution that for a given Athena query, the output is inserted into RDS (Postgres). From what I have read, it makes sense to use Glue. However, I am not comprehending the exact integration. Each query run would create a respective output that is supposed to get to RDS. Would that be Athena creating a table in Glue? However, it would rather new records needed vs a new table each time.

Przemek
已提問 7 個月前檢視次數 232 次
2 個答案
0

Hello,

To be able store the results of a query run in Amazon Athena to a RDS table you can surely make use of the AWS Glue service. In a Glue Job, you can run an Athena query to read from a catalog table using 'awswrangler' and store the results into a DataFrame like below:

import awswrangler as wr
# Retrieving the data from Amazon Athena
df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db")

The data in this dataframe 'df' can then be written into RDS either using awswrangler or using pyspark write. Please refer the below references:

[+] awswrangler: https://pypi.org/project/awswrangler/ [+] Write records stored in a DataFrame into MySQL using awswrangler: https://aws-sdk-pandas.readthedocs.io/en/stable/stubs/awswrangler.mysql.to_sql.html#awswrangler.mysql.to_sql [+] Write PySpark DataFrame to MySQL Database Table: https://sparkbyexamples.com/pyspark/pyspark-read-and-write-mysql-database-table/

Please note that both pyspark write and awswrangler provide multiple write modes such as append, overwrite, etc. As per your use-case you can either append the records to the already existing data in the table or overwrite the entire dataset.

Would that be Athena creating a table in Glue? --> No, this would not be required while writing to a JDBC source like RDS.

AWS
Anu_C
已回答 7 個月前
0

Unless you are talking about a big complex query that SparkSQL cannot run as efficiently, I would do both things in Glue. Using both tools at the same time is not ideal both in terms of integration and efficiency (while Athena runs the query, Glue is just waiting).
You don't specify what you are querying, maybe you can use Glue bookmarks to do incremental updates.

An easier way to integrate would be having a lambda calling Athena and then starting a Glue job, passing the s3 output where s3 has stored the query results.

profile pictureAWS
專家
已回答 7 個月前

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南