- Newest
- Most votes
- Most comments
One critical code component that many people overlook is retries in exception handling. There is a legacy approach that either something will succeed or it will fail; and that if it fails, it will continue to fail due to some system being hard-down, ie. the database. There are tons of reasons for transient errors, such as a DB lock, or a time-out due to resources that are in the process of auto-scaling.
It is critical to assume a non-zero error rate for legacy as well as modern, complex systems.
When transitioning from on-premises to the cloud, the underlying infrastructure gets abstracted and therefore even more complex. This complexity provides tremendous value including vastly more scalability and resiliency but the trade-offs include even more likelihood of non-zero error rates. Having simple yet thorough exception handling as well as observability is complex but essential.
Relevant content
- Accepted Answerasked 2 years ago
- asked a year ago
- asked 9 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Hi, you have to add tangible details to your question: metrics, error logs, etc. if you want to obtain meaningful support from re:Post community. "Very unstable" can mean millions of things: detailing in more details what is exactly failing will definitely help. Thanks