S3 Hudi Replication and Failover

0

Two regions, S3 replication configured with replica modification sync on the prefix where Hudi dataset is located. Hudi writes are exclusive to a single region.

  • will S3 replication maintain consistent Hudi dataset in the replicated region ?
  • If we move Hudi writes to the replication region (failover), will hudi dataset stay consistent in the original region, maintained by replica modification sync (failback) from region 2?
profile pictureAWS
已提问 2 年前448 查看次数
2 回答
0

Amazon S3 cross region replication doesn't have knowledge about objects dependencies and there is no guarantee of ordering in the replication process (older objects being replicated before newer). As a result you can get the Hudi metadata being replicated before the data and consumers failing to read data objects that are not there yet.

To ensure consistent replication you need to use pipelines written in Spark or Flink reading from the source region and writing to the target region. In that case the transactions log may differ on the target if the pipeline operates at a different frequency than the writer on the source.

AWS
已回答 2 年前
0

A potential solution could look like this:

Enter image description here

With S3 CRR enabled on your source bucket, your files will be continuously replicated over to S3 target bucket in a different region. Even thought it will not be strongly consistent, it will be eventually consistent over time. Hudi has inbuilt savepoint feature which can be enabled for source bucket at a regular interval, say every hour (these savepoints will be copied to the target region as well).

connect --path s3://source-bucket-name/hudi_trips_cow/
commits show
set --conf SPARK_HOME=/usr/lib/spark
savepoint create --commit <latest_time_from_commits_show> --sparkMaster local[2]

And when stuff hits the fan and your source region goes down, you can use the rollback feature from hudi through hudi-cli and restore the latest known good savepoint. This replicated savepoint will be consistent with what you had in your source bucket until that point in time.

connect --path s3://replication-bucket-name/hudi_trips_cow/
savepoints show
savepoint rollback --savepoint <time_from_savepoints_show> --sparkMaster local[2]
profile picture
Saawgr
已回答 5 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则