Notice history

While the investigation of the recent API unavailability incident still continues, we would like to provide an update on our most recent findings. Service stability is our highest priority, which is why we want to be as transparent as possible, and a full analysis will be provided as soon as the investigation is complete. --- Summary --- We have identified that the downtime was caused by a rare and unfortunate combination of issues in the underlying AWS services (ElastiCache), AWS infrastructural components (EBS volumes / EC2 AMI) in combination with an issue in the AppGrid codebase. We are working closely together with AWS support to find the root cause of the issue, and have taken necessary steps on our side to protect us from similar issues in the future. Even though the API has been stable since the issue was resolved, we will continue to have extra monitoring in place as a precautionary action until the final root cause has been established and addressed. We have also updated our emergency procedures to effectively deal with this scenario going forward. --- Event details --- 1. We have identified an issue in AWS ElastiCache connection handling which caused abnormal connection retention. Specifically, Amazon ElastiCache kept all of the connections open even though machines connecting to it were terminated. This issue has been escalated with AWS Support and we are working actively with them to find the root cause of this issue. 2. While our systems do not rely on ElastiCache availability (and we are continuously testing the ability of AppGrid services to continue functioning without the ElastiCache mechanism and several other supporting services), we have identified that this particular issue with ElastiCache was not accounted for in our tests and AppGrid connection handling code. As this is an abnormal behavior that did not effectively identify ElastiCache as "unavailable", connections were stuck in a waiting state, which after some time led to failed health checks of AppGrid services. 3. As part of a normal self-healing procedure, AppGrid machines were immediately replaced by new ones. Even though new instances of the API used an identical underlying AMI (Machine Image) that has been used for a long time, these machines appeared to have corrupted filesystems and failed to launch AppGrid API services. This caused an endless loop of attempts to bring new machines and tearing down faulty ones, which in turn prevented the system to recover. This issue has also been escalated with AWS Support, as the system had been successfully rotating machines with this exact configuration for several months without any issues. --- Actions taken --- Moving forward, we are continuing our investigations working with AWS Support around the clock to find the underlying issue(s) - as well as ways to ensure we are not affected by similar issues in the future. We are currently updating AppGrid connection handling to mitigate communication issues with ElastiCache in this unresponsive state with open, faulty connections, which will protect us from this issue in the future. Finally, AppGrid is currently hosted in multiple availability zones in North America, meaning that AppGrid would still operate normally even in the rare event that a complete AWS data center goes down. In the long term, we are evaluating multi-regional load-balancing of AppGrid APIs, which would provide even higher redundancy and a global distribution of service availability.

On Feb 10th, between 00:52 and 03:49 UTC, Amazon experienced network connectivity issues in the Northern California region that impacted some API calls sent to AppGrid. During the automatic failover between the different availability zones to avoid using the faulty network routes, session creations may have taken longer than expected or failed (failovers happened between 01:55 to 01:59 and 02:52 to 02:54). During this time window, AppGrid Admin console users may have experienced sporadic connection issues. The issue is now resolved. Please refer to the Amazon incident report on http://status.aws.amazon.com: 5:28 PM PST We are investigating network connectivity issues in a single Availability Zone in the US-WEST-1 Region. 6:15 PM PST We continue to investigate network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. 7:09 PM PST We have identified the root cause of the network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. Connectivity to some instances has been restored and we continue to work on the remaining instances. 8:10 PM PST Between 4:52 PM and 7:49 PM PST we experienced network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. The issue has been resolved and the service is operating normally.

All systems operational

Mar 2017

Feb 2017

Jan 2017

Leyra - Notice history

All systems operational

Notice history

Mar 2017

Feb 2017

Jan 2017