Why does my Amazon Managed Service for Apache Flink application restart?
4 minute read
My Amazon Managed Service for Apache Flink application is continues to restart.
When a task fails, the Apache Flink application restarts the failed task and other affected tasks to bring the job to a normal state.
The following are some of the causes and troubleshooting steps for this issue.
Code errors, such as NullPointerException and DataCast type, are generated at the task manager and end up at the job manager. The application is then restarted from the latest checkpoint. To detect application restarts because of unhandled exceptions in the application, check Amazon CloudWatch metrics such as, downtime. This metric displays a non-zero value during restart periods. To identify what causes this to happen, query your application logs for changes to your application's status from RUNNING to FAILED. For more information, see Analyze errors: Application task-related failures.
When you get out-of-memory exceptions, the task manager can't send healthy heartbeat signals to the job manager, and the application restarts. In this case, you might see errors in the application logs, such as TimeoutException, FlinkException, or RemoteTransportException.
Check if the application is overloaded because of CPU or memory resource pressure:
Be sure that the fullRestarts and downtime CloudWatch metrics have non-zero values.
Check the cpuUtilization and heapMemoryUtilization metrics for unusual spikes.
Check for unhandled exceptions in your application code.
To resolve spikes and stead increases, complete the following tasks:
If you turned on debug logs for the application, then the application resource utilization might be high. To reduce the amount of logging, temporarily turn on the debug logs only when you investigate issues.
Analyze the TaskManager thread dump in the Apache Flink dashboard. For example, you can identify the CPU-intensive processes from the thread dump.
To review the flame graphs that are constructed, sample the stack traces several times. To check for blocked calls, use the off-CPU flame graphs. For information about flame graphs, see Flame graphs on the Apache Flink website.
If your application is under-provisioning a source or sink, your application might experience throttling errors when it reads and writes to streaming services, such as Kinesis Data Streams. This issue might result in an application crash. To check the throughput for the source and sink, use CloudWatch metrics such as, WriteProvisionedThroughputExceeded and ReadProvisionedThroughputExceeded. To accommodate the data volume, increase the number of shards to scale up your data streams.
The FlinkKinesisProducer uses the Kinesis Producer Library (KPL) to put data from a Flink stream into a Kinesis Data Stream. A timeout error can cause failures in the KPL that might cause the Flink application to the restart. In this case, you might see an increase in the buffering time and number of retries. You can modify the RecordMaxBufferedTime, RecordTtl, and RequestTimeout configurations for the KPL so that the record doesn't expire. For more information, see default_config.properties on the GitHub website. Also, monitor important KPL metrics, such as ErrorsByCode, RetriesPerRecord, and UserRecordsPending. When these metrics show that the application restarted, use the filters in CloudWatch Logs Insights to understand the failures that caused the application to restart.
Note that not all errors lead to an immediate restart of the application. For example, errors in the application code might result in the directed acyclic graph (DAG) workflow error. In this case, the DAG for your application doesn't get created. The application shuts down and doesn't immediately restart. Also, the application doesn't immediately restart when you get an Access denied error.
If the issue still persists, contact AWS Support and provide the following information:
Information about the source and sink of your application
CloudWatch logs for your application
Time of issue in UTC
Relevant thread dumps from the Apache Flink dashboard