AWS Glue Data Catalog Table Version comparison


I'm inquiring whether it's possible to access the previous version of the catalog table in the ETL job to examine a specific column's content. Presently, as I'm updating the table from the raw bucket to the processed one, the older values of a record are being replaced by the new values. The code segment for this process is:


Within the Spark-based ETL job responsible for writing data to Redshift, my goal is to compare the previous and current versions of a record each time and verify whether the values have changed or not.

asked 4 months ago289 views
1 Answer
Accepted Answer

The catalog versions don't tell you about the data values. Also not clear how Redshift is related with the Hudi table timeline. It sounds you are looking for something like the "MERGE INTO" command to have more control over the upsert.

profile pictureAWS
answered 4 months ago
  • Thank you for you response! Sorry, I think my question needs a bit of clarity!

    I already have diferrent parquet files in the s3 bucket which I have saved using
    'hoodie.cleaner.commits.retained': 5, 'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS'

    but when I load the same table, I am getting only the latest commit \

    df ='org.apache.hudi') .option("", "_some_early_commits_time") .load("s3://bucket/path-to-hudi-table/")

    but I was hoping to get all the commits when loading!

  • You normally only use one commit, since COW will override the file and on OR you have commit + deltas. To view the history you need use the timeline This is purely a Hudi question.

  • Great! Thank you for your answer! Timeline was exactly what I was looking for :)

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions