- Newest
- Most votes
- Most comments
Thank you for describing your issue with the automatic system rollback feature in Amazon Managed Service for Apache Flink. Your observation about the system rollback not triggering is correct, and there are a few important points to consider:
-
System rollback is designed to occur before the application transitions to the RUNNING status. If your application has already successfully transitioned to RUNNING and then encounters runtime issues like state incompatibility exceptions, the automatic rollback will not be triggered.
-
The behavior you're seeing with indefinite restarts using the exponential delay strategy is expected. As you noted, the job never actually fails, it just keeps restarting, which prevents the automatic rollback from triggering.
-
In your situation, where the application is already in the RUNNING state and facing runtime errors, you have the option to manually trigger a rollback using the RollbackApplication API.
-
Regarding the maximum number of restart attempts, you're correct that this isn't directly configurable in Managed Service for Apache Flink as it is in open-source Apache Flink.
To address your issue, you have a few options:
-
Manually trigger a rollback using the RollbackApplication API if you want to revert to the previous working version.
-
Force-stop the application, which will allow you to then update it with a corrected version that resolves the state incompatibility issues.
-
Review your application code and configuration to resolve the state incompatibility exceptions, then update the application with the corrected version.
For troubleshooting, you can use the ListApplicationOperations API to view the history of operations on your application, and the DescribeApplicationOperation API to get more details on specific operations. This can help you understand what's happening with your application and why the rollback isn't triggering as expected.
If you continue to face issues, it may be helpful to check the AWS Health Dashboard for any service-related problems or consider opening a support case for more personalized assistance.
Sources
System rollback best practices - Managed Service for Apache Flink
Improve the resilience of Amazon Managed Service for Apache Flink application with system-rollback feature | AWS Big Data Blog
Relevant content
- asked 8 months ago
- asked 2 years ago
