Why does Firehose create multiple small files in Amazon S3?

3 minute read
0

When I push data from Amazon Data Firehose to Amazon Simple Storage Service (Amazon S3), Firehose creates small files in my S3 bucket.

Short description

Firehose delivers smaller records than the BufferingHints API specifies for the following reasons:

  • You turned on compression.
  • The Firehose delivery stream scaled.
  • You listed Amazon Kinesis Data Streams as the data source.

Resolution

You turned on compression

If you turn on compression for your Firehose delivery stream, then Firehose applies the SizeInMBs and IntervalInSeconds parameters of the BufferingHints API before compression.

After each batch of records buffers, Firehose applies the parameters. Then, Firehose compresses the data records and creates small files in Amazon S3.

The Firehose delivery stream scaled

If you request a quota increase or Firehose automatically scales, then the Firehose delivery stream scales. By default, Firehose automatically scales delivery streams up to a certain quota. This automatic scaling behavior reduces throttle without a quota increase.

When Firehose delivery streams scale, the BufferingHints API might be affected.

Note: When you configure Firehose, you can set the buffer size.

Within the Firehose delivery stream, Firehose buffers data in parallel channels and delivers the data simultaneously. For example, Firehose buffers data and creates a single file based on the buffer size quota. If Firehose scales to double the current throughput quota, then two separate channels create files in the same time interval. If Firehose scales up to four times, then there are four different channels that create four files in Amazon S3 during the same time interval.

When delivery streams scale, Firehose creates smaller files if the scaling factor and incoming traffic volume don't match. For example, if Firehose scales to four times the original capacity and the incoming traffic also increases four times the original traffic volume, then the file sizes stay consistent. However, when Firehose scales four times but traffic stays the same, Firehose distributes the same volume of data across more channels. Then, the files result in smaller sizes.

Note: The number of files might increase in both scenarios because of the multiple parallel buffering channels.

Make sure that the Firehose delivery stream doesn't scale past the default quota. To view the current quota for your Firehose delivery stream, check the following Amazon CloudWatch metrics:

  • BytesPerSecondLimit
  • RecordsPerSecondLimit
  • PutRequestsPerSecondLimit

If the values of these metrics differ from the default quotas, then the Firehose delivery stream scaled.

You listed Kinesis Data Streams as the data source

When you list a data stream as a data source of Firehose, then Firehose scales internally. By default, Firehose scales to meet the volume capacity of the data stream. When Firehose scales, the buffer size changes and can lead to the delivery of smaller sized records.

Note: Firehose treats buffering hint options as hints. As a result, Firehose might choose to use different values to optimize buffer.

AWS OFFICIAL
AWS OFFICIALUpdated 17 days ago
1 Comment

I am experiencing this exact issue. We have a KDS writing files to S3 using Firehose with no compression enabled. Firehose is Buffer Size is configured to its max (128MiB) and the Buffer Interval is maxed out (900 seconds) yet most output files are ~50MB. I can't imagine a more basic use case for KDS and Firehose.

AWS, are you saying that there is no solution to increase the file size written? We plan to stream directly to Redshift one day. What guarantee do I have that it won't overwhelm Redshift if it's consistently writing small batches as it is into S3? Please provide guidance. Thank you!

replied 4 months ago