I'm trying to write data from Flink to Amazon Kinesis Data Streams, but I receive a timeout or exception error. Why is this happening and how do I troubleshoot these errors?
Short Description
Flink applications that use FlinkKinesisProducer can produce one of the following error messages:
Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.ConnectTimeoutException: Connect to kinesis.us-east-1.amazonaws.com:443 [kinesis.us-east-1.amazonaws.com/xxx.xxx.xxxx.xxx] failed: connect timed out
[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28
These two timeout errors are caused by network problems and a lack of system resources in the environment where the Flink application is running.
Resolution
Unable to connect to Kinesis Data Streams service endpoint
The following error occurs when the Flink application is unable to connect to the Data Streams service endpoint:
Caused by: org.apache.flink.kinesis.shaded.org.apache.http.conn.ConnectTimeoutException: Connect to kinesis.us-east-1.amazonaws.com: 443 [ kinesis.us-east-1.amazonaws.com/xxx.xxxx.xxx] failed:connect timed out
If this error repeatedly occurs, then there could be a problem with your network configuration.
To resolve this issue, perform the following steps:
1. Verify that the Flink application can connect to the internet.
2. If your Flink application is running on AWS resources in a virtual private cloud (VPC), verify that the following VPC features are configured correctly:
Route Table
Security Groups
Network Access Control Lists (ACL)
3. (Optional) You can also use Data Stream's VPC endpoint to communicate within your VPC.
Response for the submitted request wasn't returned within the configured timeout period
The following Curl 28 error indicates that the response for the submitted request was not returned within the configured timeout period. Therefore, a timeout occurred:
[AWS Log: ERROR](CurlHttpClient)Curl returned error code 28
The timeout occurred because of a temporary network issue. The timeout might also be caused by too many pending requests to Data Streams, where records are sent to the Kinesis Producer Library (KPL) daemon. Records are sent to the KPL because FlinkKinesisProducer uses the KPL to send data from a Flink stream into an Amazon Kinesis stream.
To resolve this issue, change the following configuration parameter of the FlinkKinesisProducer object:
Request timeout period: producerConfig.put (“RequestTimeout”, “****”); I
- Internal Queue Size: FlinkKinesisProducer #setQueueLimit (queueLimit)
It's also a best practice to update the following parameters to avoid data loss:
Internal Queue Size: FlinkKinesisProducer #setQueueLimit (queueLimit)
time-to-live on records: producerConfig.put("RecordTtl", "*****");
For more information about calculating the value of setQueueLimit, see Backpressure on the Apache website.