Why does the checkpoint in my Amazon Managed Service for Apache Flink application fail?
4 minute read
The checkpoint or savepoint in my Amazon Managed Service for Apache Flink application keeps failing.
Checkpointing is the method that's used to implement fault tolerance in Amazon Managed Service for Apache Flink. If you don't optimize or properly provision your application, then you might experience checkpoint failures.
Some of the major causes for checkpoint failures are the following:
For Rocks DB, Apache Flink reads files from local storage and writes to the remote persistent storage that's Amazon Simple Storage Service (Amazon S3). The performance of the local disk and upload rate might affect checkpointing and result in checkpoint failures.
Savepoint and checkpoint states are stored in a service-owned Amazon S3 bucket that AWS fully manages. These states are accessed whenever an application fails over. Transient server errors or latency in the S3 bucket might lead to checkpoint failures.
A process function that you created where the function communicates with an external resource during checkpointing, such as Amazon DynamoDB, might result in checkpoint failures.
State serialization, such as serializer mismatch with the incoming data, can cause checkpoint failures.
The number of Kinesis Processing Units (KPUs) that are provisioned for the application might not be sufficient. To find the allocated KPUs, use the following calculation:
Allocated KPUs for the application = Parallelism / ParallelismPerKPU
Larger application state sizes might lead to an increase in checkpoint latency. The task manager takes more time to save the checkpoint and can cause an out-of-memory exception.
A skewed state distribution can result in one task manager that handles more data compared with other task managers. Even if sufficient KPUs (resources) are provisioned, overloaded task managers can cause an out-of-memory exception.
A high cardinality indicates that there's a large number of unique keys in the incoming data. If the job uses the KeyBy operator to partition incoming data and the key with the data has high cardinality, then slow checkpointing might occur. The slow checkpointing might eventually result in checkpoint failures.
To troubleshoot your failing Amazon Managed Service for Apache Flink application, take the following actions:
Use the lastCheckPointDuration and lastCheckpointSize Amazon CloudWatch metrics to monitor your checkpoint size and duration. The size of your application state might increase rapidly and cause an increase in the checkpoint size and duration. For more information, see Application metrics.
Increase the parallelism of the operator that processes more data. You can use the setParallelism() method to define the parallelism for an individual operator, data source, or data sink. For more information, see Parallel execution on the Flink website. Note: When you increase the operator parallelism, you might affect the total checkpoint time. You can also use unaligned checkpoints and optimize accordingly. For more information, see Checkpointing under backpressure on the Flink website.
Tune the Parallelism and ParallelismPerKPU values for optimum KPU utilization. Make sure that automatic scaling is turned on for your Amazon Managed Service for Apache Flink application. The value of the maxParallelism parameter allows you to scale the number of KPUs. For more information, see Application Scaling in Amazon Managed Service for Apache Flink.
Define TTL on the state to make sure that the state is periodically cleaned. For more information, see Class StateTtlConfig.Builder on the Flink website.
Optimize the code for better partitioning. Use rebalance partitioning to help evenly distribute the data. Rebalance partitioning uses a round robin method for distribution. For more information, see Rebalance on the Flink website.
Optimize the code to reduce the window size so that the cardinality of the number of keys in the window is reduced.