My Amazon Kinesis Data Analytics for Apache Flink application is restarting.
Resolution
When a task fails, the Apache Flink application restarts the failed task and other affected tasks to bring the job to a normal state.
The following are some of the causes and respective troubleshooting steps for this condition:
-
Code errors, such as NullPointer Exception and DataCast type, are generated at the task manager and bubbled up to the job manager. The application is then restarted from the latest checkpoint. To detect application restarts due to unhandled exceptions in the application, check Amazon CloudWatch metrics such as downtime. This metric displays a non-zero value during restart periods. To identify the causes for this condition, query your application logs for changes from your application's state from RUNNING to FAILED. For more information, see Analyze errors: Application task-related failures.
-
When you get out-of-memory exceptions, the task manager can't send healthy heartbeat signals to the job manager, leading to a restart of the application. In this case, you might see errors, such as TimeoutException, FlinkException, or RemoteTransportException in the application logs. Check if the application is overloaded due to CPU or memory resource pressure.
- Be sure that the CloudWatch metrics fullRestarts and downtime have non-zero values.
- Check the metrics cpuUtilization and heapMemoryUtilization for unusual spikes.
- Check for unhandled exceptions in your application code.
- Check for checkpoint and savepoint failures. Monitor the CloudWatch metrics numOFFailedCheckpoints, lastCheckpointSize, and lastCheckpointDuration for spikes and stead increases.
To resolve this issue, try the following:
- If you've enabled debug logs for the application, the consumption of application resources might be high. Reduce the amount of logging by temporarily enabling the debug logs only when investigating issues.
- Analyze the TaskManager thread dump in the Apache Flink dashboard. For example, you can identify the CPU-intensive processes from the thread dump.
- Review the Flame graphs that are constructed by sampling the stack traces several times. You can use the Flame graphs to do the following: visualize the overall application health, identify methods that consume the most CPU resources, and identify the series of calls on the stack that led to the execution of a particular method. Check for blocked calls using the off-CPU Flame graphs.
-
If your application is under-provisioning a source or sink, your application might experience throttling errors when reading and writing to streaming services such as Kinesis Data Streams. This condition might eventually lead to application crash. Check the throughput for the source and sink using CloudWatch metrics such as WriteProvisionedThroughputExceeded and ReadProvisionedThroughputExceeded. Consider scaling up your data streams to accommodate the data volume by increasing the number of shards.
-
The FlinkKinesisProducer uses the Kinesis Producer Library (KPL) to put data from a Flink stream to a Kinesis Data Stream. Errors, such as timeout, cause failures in KPL that might eventually lead to the restart of the Flink application. In such cases, you might see an increase in the buffering time and number of retries. You can tune the following configurations for KPL in such a way that the record don't expire: RecordMaxBufferedTime, RecordTtl, and RequestTimeout. Also, monitor important KPL metrics such as ErrorsByCode, RetriesPerRecord, and UserRecordsPending. When these metrics indicate that the application is restarting, use the filters in CloudWatch Logs Insights to know the exact reasons for the errors that led to the restart.
-
Note that not all errors lead to an immediate restart of the application. For example, errors in the application code might result in the DAG workflow error. In this case, the directed acyclic graph (DAG) for your application doesn't get created. The application shuts down and doesn't restart immediately. Also, the application doesn't restart immediately when you get an "access denied" error.
If the issue still persists, contact AWS Support and provide the following information:
- Application ARN
- Information about the source and sink of your application
- CloudWatch logs for your application
- Time of issue in UTC
- Relevant thread dumps from the Flink dashboard
Related information
Application is restarting