Continuously replicate AWS Aurora MySQL to Redshift

0

Is there an easy way to tell Glue to copy entire Aurora MySQL schema/all tables only over to Redshift? Also, how do I tell it to run continuously and update Redshift?

jeffph
질문됨 5년 전678회 조회
1개 답변
0
수락된 답변

This answer assumes you already have Glue connections to both Aurora and Redshift set up and have created a crawler for the Aurora database with an include path similar to "schema/%" to fetch all tables in that schema.

Given that, when you create a Glue Job in the console, you select a single table from the Glue Data Catalog as the source. As the target, you choose to "Create tables in your data target" and select the Redshift connection previously created and specify the target table. Once that Glue Job is created, you can create a Trigger to run it on a scheduled basis.

Now you're set up for one table, but if you wanted to do this for every table in that schema, you would have to modify the Glue script and either hard-code the list of tables that you want to sync and loop through them in the code, or you could also fetch the list of tables from the Glue API using the boto3 client if you're using pyspark.

If you want to run this continuously, there are a couple things to mention:

  1. For tables that have sequential primary keys, you can enable job bookmarks as mentioned here such that only new data is inserted into Redshift: https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html
  2. If that's not an option, you'll likely need to truncate the table on every run. You can do this by adding a "preactions" key to the connection options when writing the dynamic frame.
"preactions":"truncate table target_table;"

Without doing one of those, duplicate data will continue to get loaded into Redshift.

For ongoing replication, Database Migration Service may be an option here. I realize there's more infrastructure to set up, but it replicates data using the binary logs and can be more efficient as a result. It can also do this continuously if near-real-time replication is important to you.

AWS
dacort
답변함 5년 전
profile picture
전문가
검토됨 9달 전
  • Hi, I have the duplication problem, but I am using MySQL instead of Redshift as a target, I tried to use preactions but that doesn't work, is there any similar option or another way to avoid this problem?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠