Why is Kinesis Data Firehose creating so many small files in S3?

3 minute read

I'm trying to push data from Amazon Kinesis Data Firehose to Amazon Simple Storage Service (Amazon S3). However, I noticed that Kinesis Data Firehose is creating many small files in my Amazon S3 bucket. Why is this happening?

Short description

Kinesis Data Firehose delivers smaller records than specified (in the BufferingHints API) for the following reasons:

  • Compression is enabled.
  • Kinesis Data Firehose delivery stream has scaled.
  • Amazon Kinesis Data Streams is listed as the data source.


Compression is enabled

If compression is enabled on your Kinesis Data Firehose delivery stream, both of the BufferingHints parameters are applied before the compression. Check the SizeInMBs and IntervalInSeconds parameters to confirm.

After each batch of records is buffered, the parameters are applied. When the data records are buffered and compressed, smaller files are created in Amazon S3.

Kinesis Data Firehose delivery stream has scaled

If a limit increase was requested or Kinesis Data Firehose has automatically scaled, then the Data Firehose delivery stream can scale. By default, Kinesis Data Firehose automatically scales delivery streams up to a certain limit. Amazon Kinesis' automatic scaling behavior reduces the likelihood of throttling without requiring a limit increase.

When Kinesis Data Firehose's delivery stream scales, it can cause an effect on the buffering hints of Data Firehose.

Note: The BufferSize is set when you configure your Kinesis Data Firehose.

There is also a proportional number of parallel buffering within the Kinesis Data Firehose delivery stream, where data is delivered simultaneously from all these buffers. For example, Kinesis Data Firehose can buffer the data and create a single file based on the buffer size limit. If Kinesis Data Firehose scales to double the current throughput limit, then two separate channels will create the files within the same time interval. If Kinesis Data Firehose scales up to four times, there will be four different channels creating four files in S3 during the same time interval.

Note: The number of channels created internally will depend on Kinesis Data Firehose. In the example above, four channels were created.

Check to make sure that the Kinesis Data Firehose delivery stream hasn't scaled beyond the default limit. To view the current limit of your Kinesis Data Firehose delivery stream, check the following Amazon CloudWatch metrics:

  • BytesPerSecondLimit
  • RecordsPerSecondLimit
  • PutRequestsPerSecondLimit

If the values of these metrics differ from the default quota limits, then it indicates that Kinesis Data Firehose' delivery stream has scaled.

Kinesis Data Stream is listed as the data source

When a Kinesis Data Stream is listed as a data source of Kinesis Data Firehose, then Kinesis Data Firehose scales internally. By default, Kinesis Data Firehose tries to meet the volume capacity of the Kinesis Data Stream. This scaling causes a change in the buffering size and can lead to the delivery of smaller sized records.

Note: Buffering hint options are treated as hints. As a result, Kinesis Data Firehose might choose to use different values to optimize the buffering.

AWS OFFICIALUpdated 2 years ago