- Newest
- Most votes
- Most comments
Amazon S3 cross region replication doesn't have knowledge about objects dependencies and there is no guarantee of ordering in the replication process (older objects being replicated before newer). As a result you can get the Hudi metadata being replicated before the data and consumers failing to read data objects that are not there yet.
To ensure consistent replication you need to use pipelines written in Spark or Flink reading from the source region and writing to the target region. In that case the transactions log may differ on the target if the pipeline operates at a different frequency than the writer on the source.
A potential solution could look like this:
With S3 CRR enabled on your source bucket, your files will be continuously replicated over to S3 target bucket in a different region. Even thought it will not be strongly consistent, it will be eventually consistent over time. Hudi has inbuilt savepoint feature which can be enabled for source bucket at a regular interval, say every hour (these savepoints will be copied to the target region as well).
connect --path s3://source-bucket-name/hudi_trips_cow/
commits show
set --conf SPARK_HOME=/usr/lib/spark
savepoint create --commit <latest_time_from_commits_show> --sparkMaster local[2]
And when stuff hits the fan and your source region goes down, you can use the rollback feature from hudi through hudi-cli
and restore the latest known good savepoint. This replicated savepoint will be consistent with what you had in your source bucket until that point in time.
connect --path s3://replication-bucket-name/hudi_trips_cow/
savepoints show
savepoint rollback --savepoint <time_from_savepoints_show> --sparkMaster local[2]
Relevant content
- asked 9 months ago
- asked a year ago
- asked 2 years ago
- AWS OFFICIALUpdated 6 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 10 months ago
- AWS OFFICIALUpdated a year ago