Redshift auto-copy interval and streaming ingestion storage

1

Hi, I was testing Redshift's new features - auto copy from s3 and streaming ingestion. I have a few questions regarding these features.

  1. I understand that Redshift automatically decides the number of files to upload on Redshift in a batch. I wonder how often Redshift detects the file and try to upload the file. Does it decide the timing to upload files based on a specific file size or a specific time interval?
  2. When streaming data is transferred to Redshift from Kinesis data streams with streaming ingestion features, where will the data be stored? Will data be stored in Kinesis queue for 24 hours or not stored in anywhere?

Thanks.

  • Redshift does not automatically detect and upload files to itself. Instead, you will need to use a separate tool or process to load data into Redshift from external sources.

    COPY command: You can use the COPY command to load data from Amazon S3, Amazon EMR, or other sources into Redshift. The COPY command is designed to efficiently load large amounts of data in parallel from multiple files. It can automatically detect the number of files to load in a batch and will divide the workload among the nodes in your Redshift cluster.

  • Redshift Data Transfer Task: You can use the Redshift Data Transfer Task to schedule data loads into Redshift from a variety of sources, including Amazon S3, Amazon RDS, and Amazon DynamoDB. The Data Transfer Task can be configured to load data at specific intervals or on a one-time basis.

    Third-party tools: There are also several third-party tools and services available that can be used to load data into Redshift, such as Talend, Fivetran, and Stitch. These tools typically offer a range of features and configuration options,

  • In general, Redshift does not have a specific file size or time interval that it uses to determine when to load data. Instead, the decision of how and when to load data into Redshift will depend on the specific requirements of your application and the tools or processes that you are using to load the data.

  • When you use the streaming ingestion feature of Amazon Redshift to load data from an Amazon Kinesis data stream, the data is not stored in the Kinesis data stream or in any other intermediate location. Instead, the data is transferred directly from the Kinesis data stream to Redshift and stored in the designated table within Redshift.

    By default, the streaming ingestion feature of Redshift loads data from the Kinesis data stream in real-time, as the data becomes available in the stream.

  • It is also worth noting that Kinesis data streams retain data for a configurable period of time, known as the retention period. By default, the retention period is 24 hours, but it can be set to a maximum of 7 days. This means that data that is added to the Kinesis data stream will be retained in the stream for up to the retention period, even if it has already been processed and transferred to Redshift. However, once the retention period has expired, the data is deleted from the Kinesis data stream and is no longer available.

Sojeong
asked a year ago328 views
1 Answer
0
  1. The auto-copy feature tries to balance out between incoming data velocity and the ingestion performance. There is not set threshold for N files or after N seconds, for the auto-copy to process the files. If the files are coming in rapidly then Redshift might wait a while and process a bunch of these files as a single batch. If the files are coming in slowly then its possible that Redshift might trigger the load even for a single file.

  2. In the streaming ingestion feature the data is stored in target Redshift Materialized View (MV). New data from Kinesis is pulled into Redshift on refresh of this MV.

profile pictureAWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions