1 Answer
- Newest
- Most votes
- Most comments
0
The catalog versions don't tell you about the data values. Also not clear how Redshift is related with the Hudi table timeline. It sounds you are looking for something like the "MERGE INTO" command to have more control over the upsert.
Relevant content
- asked 5 months ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated a year ago
Thank you for you response! Sorry, I think my question needs a bit of clarity!
I already have diferrent parquet files in the s3 bucket which I have saved using
'hoodie.cleaner.commits.retained': 5, 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'
but when I load the same table, I am getting only the latest commit \
df = spark.read.format('org.apache.hudi') .option("hoodie.datasource.read.begin.instanttime", "_some_early_commits_time") .load("s3://bucket/path-to-hudi-table/")
but I was hoping to get all the commits when loading!
You normally only use one commit, since COW will override the file and on OR you have commit + deltas. To view the history you need use the timeline https://hudi.apache.org/docs/timeline/. This is purely a Hudi question.
Great! Thank you for your answer! Timeline was exactly what I was looking for :)