Does Redshift Spectrum support Incremental Ingestion

0

I have created an External Table using Redshift Spectrum and using AWS Glue to crawl deeply nested json files coming into s3 bucket every second.

I was able to populate a redshift table by extracting values from external table. But facing an issue with the incremental ingestion as all the values are loaded into Redshift everytime for new files being crawled.

How to capture the new data in s3 bucket so that only new columns got loaded into Redshift table.

Sample file: { "Messages":[ { "Attributes":{"docType":"Test"}, "Data":{ "key1":"value1", "key2":["a1":"v1", "a2":"v2"], "key3":{{"b1":"u1"}, "b2":[{"c1":"u2"}, {"c2":"u3"} ] } } }, { "Attributes":{"docType":"Test2"}, "Data":{ "key1":"value1", "key2":["a1":"v1", "a2":"v2"], "key3":{{"b1":"u1"}, "b2":[{"c1":"u2"}, {"c2":"u3"} ] } } } ] }

sravan
已提問 2 個月前檢視次數 207 次
2 個答案
0

You cannot just read the new columns, for that you would need a columnar format like parquet.
Also incremental ingestion normally refers to loading new files, for that you could use Glue bookmarks (running a Glue job instead of Spectrum) or putting new files on different folders(partitions) and telling Spectrum to load just that)

profile pictureAWS
專家
已回答 2 個月前
  • How to dynamically change the partition values so that we could automate this job !

  • If you mean filtering partitions, you would need to build your query with the values you need, for instance using the current date for date related columns

0
profile pictureAWS
專家
已回答 2 個月前
  • We want to explore the options to load the data without flattening the json using AWS Glue job to reduce the billing

您尚未登入。 登入 去張貼答案。

一個好的回答可以清楚地回答問題並提供建設性的意見回饋,同時有助於提問者的專業成長。

回答問題指南