How do I troubleshoot timeout errors when writing from Flink to Kinesis Data Streams?

2 minute read

When I write data from Flink to Amazon Kinesis Data Streams, I receive a timeout or exception error.

Short Description

Flink applications that use FlinkKinesisProducer can produce one of the following error messages:
"Caused by: Connect to [] failed: connect timed out"


"[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28"

Network problems and a lack of system resources in the environment where the Flink application is running can cause these timeout errors.


Can't connect to Kinesis Data Streams service endpoint

When the Flink application can't connect to the Data Streams service endpoint, you receive an error similar to the following:

"Caused by: Connect to 443 [] failed:connect timed out"

If this error repeatedly occurs, then there might be a problem with your network configuration.

To resolve this issue, complete the following steps:

  1. Verify that the Flink application can connect to the internet.
  2. If your Flink application is running on AWS resources in a virtual private cloud (VPC), then verify that the following VPC features are correctly configured:
    Route table
    Security groups
    Network access control lists (network ACL)
  3. (Optional) Use Data Stream's VPC endpoint to communicate within your VPC.

Response for the submitted request wasn't returned within the configured timeout period

A Curl 28 error indicates that the response for the submitted request was not returned within the configured timeout period and caused a timeout error. The error might look similar to the following:

"[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28"

The timeout can occur because of a temporary network issue. Or, the timeout can be from too many pending requests to Data Streams where records are sent to the Kinesis Producer Library (KPL) daemon. Records are sent to the KPL because FlinkKinesisProducer uses the KPL to send data from a Flink stream into an Amazon Kinesis stream.

To resolve this issue, increase the Request timeout period of the FlinkKinesisProducer object:

Request timeout period: producerConfig.put ("RequestTimeout", "\*\*\*\*"); I

It's also a best practice to update the setQueueLimit and RecordTtl parameters to avoid data loss:

Internal Queue Size: FlinkKinesisProducer #setQueueLimit (queueLimit)
time-to-live on records: producerConfig.put("RecordTtl", "*****");

For more information about calculating the value of setQueueLimit, see Monitoring back pressure on the Flink website.

AWS OFFICIALUpdated 5 months ago