Does Redshift Spectrum support Incremental Ingestion

Question

I have created an External Table using Redshift Spectrum and using AWS Glue to crawl deeply nested json files coming into s3 bucket every second.

I was able to populate a redshift table by extracting values from external table. But facing an issue with the incremental ingestion as all the values are loaded into Redshift everytime for new files being crawled.

How to capture the new data in s3 bucket so that only new columns got loaded into Redshift table.

Sample file:
{
	"Messages":[
			{	
				"Attributes":{"docType":"Test"},
				"Data":{
					"key1":"value1",
					"key2":["a1":"v1",
							"a2":"v2"],
					"key3":{{"b1":"u1"},
							"b2":[{"c1":"u2"},
								  {"c2":"u3"}
							]
					}
				}
			},
			{	
				"Attributes":{"docType":"Test2"},
				"Data":{
					"key1":"value1",
					"key2":["a1":"v1",
							"a2":"v2"],
					"key3":{{"b1":"u1"},
							"b2":[{"c1":"u2"},
								  {"c2":"u3"}
							]
					}
				}
			}
	]
}

Answer

You cannot just read the new columns, for that you would need a columnar format like parquet.   
Also incremental ingestion normally refers to loading new files, for that you could use Glue bookmarks (running a Glue job instead of Spectrum) or putting new files on different folders(partitions) and telling Spectrum to load just that)

Answer

Have you configured an ETL job to merge data? https://github.com/sinemozturk/INCREMENTAL-DATA-LOADING-FROM-AWS-S3-BUCKET-TO-REDSHIFT-BY-USING-AWS-GLUE-ETL-JOB

Does Redshift Spectrum support Incremental Ingestion

相关内容