Leyra - Notice history

All systems operational

Admin Console - Operational

100% - uptime
Feb 2017 · 100.0%Mar · 100.0%Apr · 100.0%
Feb 2017
Mar 2017
Apr 2017

Delivery API - Operational

100% - uptime
Feb 2017 · 100.0%Mar · 100.0%Apr · 100.0%
Feb 2017
Mar 2017
Apr 2017

Web Platform - Operational

100% - uptime
Feb 2017 · 100.0%Mar · 100.0%Apr · 100.0%
Feb 2017
Mar 2017
Apr 2017

Notice history

Apr 2017

Elevated API Errors
  • Update
    Update

    #RCA on AppGrid API elevated response times after 2.12.0 upgrade # Executive Summary As of April 25th 19:17 CEST, AppGrid has been upgraded to 2.12.0, a major release involving a larger infrastructural change, moving to a scalable micro-service architecture, relying on AWS ECS (Elastic Container Services). This release was an important milestone for the product as it allows us to more easily monitor, scale and upgrade micro-services independently. Following the release, we have experienced two unforeseen issues that affected the AppGrid API response time and availability. Both issues are now resolved. For the full sequence of events and actions taken to prevent this from happening in the future, please refer to the below detailed root cause analysis. # Root Cause Analysis The AppGrid upgrade from 2.11.0 to 2.12.0 was completed on April 25th at 19:17 CEST, after a gradual switch from the old environment to the new starting at about 14:00 CEST. The system was being carefully monitored for any kind of issues relating to errors, elevated response times and other signs of issues both during the gradual switch as well as after the switch had been made. At 22:00 CEST, we noticed slightly increasing response times (250ms average) on a couple of database-intense API endpoints as traffic was increasing. At 22:30 the root cause was identified as the machines responsible for operating the databases being configured with slightly slower hard drives than intended (mechanical hard drives rather than SSD drives). Machines with SSD drives were rotated into the cluster and were in full effect at **approximately 23:30 CEST and the API response times were back to normal**. During this incident, the API was continuously serving requests but with a majority of requests experiencing an increase response time. Below, a graph of the elevated response times can be seen during the time window above. Note that times in the charts are presented in UTC. ![AWS CloudWatch report on April 25th](https://s3-us-west-1.amazonaws.com/appgrid-test-reports/mongo.png "AWS CloudWatch report on April 25th") On April 26th at roughly 10:00 CEST, we were notified of a non-production service integrating with AppGrid having trouble issuing certain CORS requests toward the /asset/file/ and /content/file/ API endpoint. After some investigation, we identified the issue as a change in how the API dealt with a specific type of CORS request when the request had previously been cached on the CloudFront CDN and a subsequent request came from a different domain origin. This resulted in production domains being most likely to be served files from the CDN, as requests from those domains were the ones being cached. Even though this should have affected only internal, non-production environments, the CloudFront CDN cache for these two endpoints was temporarily disabled at 11:29 CEST to remedy the issue, while the solution to the Cross-Origin API behaviour was being fixed. At this point in time, the API was under a level of load that the system was more than capable of handling without this feature in place. As a result of disabling it, these two endpoints experienced slightly higher response times. The system was being carefully monitored for service stability at this time. At 14:30 CEST, we experienced an increase in traffic on the API at the same time as more resource-heavy requests being served, and with CloudFront caching disabled for the two affected endpoints, the API scaling policy had to be adjusted to cope with the extra traffic to remedy the increase in response time. At 15:01 CEST, the decision was taken to re-enable CloudFront CDN caching for the two endpoints while the fix to the root cause of the CORS issue was still being finalised and deployed. CloudFront changes were propagated globally, **a process that was complete at around 15:20 CEST**. The final solution to the initial CORS issue was deployed into production at 16:08 CEST and **the full effect of this should have been seen at around 16:15 CEST**. Below, a graph of the elevated response times can be seen during the time window above. ![AWS CloudWatch report on April 26th](https://s3-us-west-1.amazonaws.com/appgrid-test-reports/cf.png "AWS CloudWatch report on April 26th") #Preventive actions 1. While the initial error relating to what hard drive types were allocated to the database machines was caused by human factors, we could have spotted this mistake earlier in our load testing scenarios. We have identified two discrepancies in the comparison of our load test scenarios with live traffic, one being that the system haven't been subjected to the exact same conditions for certain cache eviction operations. The second is that there were certain API traffic patterns that haven't been simulated properly when requesting certain API endpoints with a combination of multiple query parameters. While it is very hard to exactly mimic global API traffic patterns, we are currently working on improving our load test scenario to incorporate these types of scenarios as well. 2. The second issue relating to CORS handling that we were notified about on April 26th, although limited in severity and scope (due to the intricate set of conditions and affecting mainly non-primary web domains), should have been identified in during integration testing. The root cause of this issue being how the CloudFront cache was storing domain origins rather than wildcard has been fixed and the scenario has been added to the integration test suite. In retrospect, it was a bad decision to disable the CloudFront caching mechanism for these endpoints without first modifying the scaling policies for underlying services. 3. Finally, we have fine-tuned the auto-scaling policies in the new ECS cluster based on the findings in these two incidents to provide an even more reactive environment. We once again want to apologize for the service degradation that was caused by these incidents and want to emphasize that service availability and stability continues to be our absolute highest priority.

  • Resolved
    Resolved

    The rollout of the solution to the CORS issue has been successfully completed and the issue is resolved as of 16:15 CET. Again, we are sorry for the inconveniences this has caused our customers and will continue to monitor the service for stability. A full root cause analysis and the preventative steps taken will be provided shortly.

  • Monitoring
    Monitoring

    The final solution to the initial CORS issue is currently being deployed into production and the effects should be seen within a couple of minutes. We are closely monitoring the API for service stability and will provide updates as we proceed.

  • Update
    Update

    CloudFront changes have now been propagated globally and the API response times should be back to normal. We are currently preparing for releasing the final solution to the initial CORS issue and will get back as soon as this is under way.

  • Identified
    Identified

    Due to the heavy increase in traffic, we are currently re-enabling the CloudFront CDN distribution until we have the proper solution to the root cause fixed relating to the initial CORS issues. The preliminary deployment schedule for 17:00 CEST today still holds. The effect of enabling CloudFront should be seen within the next 20 minutes at which point the API timeouts should be gone, but as a side-effect, the initial CORS issue might arise for some users. For this, we have a temporary fix in place to mitigate the issue until the final solution is live. We are sorry for any inconvenience this is causing, and we assure you that we are working on resolving the issue in the most timely and secure manner possible.

  • Update
    Update

    Due to a heavy increase in API traffic and as a result of the previous disabling of our CloudFront CDN, some requests might not be correctly processed at the moment. We are increasing our scaling policy in order to handle this extra load and requests should start processing correctly shortly.

  • Monitoring
    Monitoring

    While we continue resolving the root cause of the issue relating to CORS requests for the CDN cached /asset and /content/file endpoints found earlier today, the API continues to behave as expectedly. We will continue to monitor the health of these endpoints until the CDN cache is operational again, and in the mean time you might experience slightly longer response times on these two endpoints. The solution should be deployed into production by 17:00 CEST today.

  • Identified
    Identified

    We have identified the elevated errors relating to the API endpoints with CDN caching enabled (specifically /asset and /content/file) and narrowed down the issue to a behaviour change of affected endpoints in processing Cross-Origin requests. This could have resulted in some Cross-Origin requests being made towards the CDN failed. As an intermediate solution, we have disabled CDN caching while addressing the underlying issue. As the process of disabling and invalidating the CDN takes a short while to propagate globally, the issue could persist for a short time, but both API endpoints should be working as expected again very soon for all requests. During the time at which the CDN is disabled, the affected API endpoints can exhibit slightly longer response times than usual. We will update the incident report as we work on the solution for the underlying issue.

  • Investigating
    Investigating

    We're experiencing an elevated level of API errors relating to downloading files from the AppGrid CDN endpoints and are currently looking into the issue.

Mar 2017

Elevated API Errors
  • Resolved
    Resolved

    Following the API incident on March 13, we have taken corrective actions to avoid similar issues with Amazon ElastiCache, as detailed in the previous update. All systems have been stable since then.

  • Update
    Update

    While the investigation of the recent API unavailability incident still continues, we would like to provide an update on our most recent findings. Service stability is our highest priority, which is why we want to be as transparent as possible, and a full analysis will be provided as soon as the investigation is complete. --- Summary --- We have identified that the downtime was caused by a rare and unfortunate combination of issues in the underlying AWS services (ElastiCache), AWS infrastructural components (EBS volumes / EC2 AMI) in combination with an issue in the AppGrid codebase. We are working closely together with AWS support to find the root cause of the issue, and have taken necessary steps on our side to protect us from similar issues in the future. Even though the API has been stable since the issue was resolved, we will continue to have extra monitoring in place as a precautionary action until the final root cause has been established and addressed. We have also updated our emergency procedures to effectively deal with this scenario going forward. --- Event details --- 1. We have identified an issue in AWS ElastiCache connection handling which caused abnormal connection retention. Specifically, Amazon ElastiCache kept all of the connections open even though machines connecting to it were terminated. This issue has been escalated with AWS Support and we are working actively with them to find the root cause of this issue. 2. While our systems do not rely on ElastiCache availability (and we are continuously testing the ability of AppGrid services to continue functioning without the ElastiCache mechanism and several other supporting services), we have identified that this particular issue with ElastiCache was not accounted for in our tests and AppGrid connection handling code. As this is an abnormal behavior that did not effectively identify ElastiCache as "unavailable", connections were stuck in a waiting state, which after some time led to failed health checks of AppGrid services. 3. As part of a normal self-healing procedure, AppGrid machines were immediately replaced by new ones. Even though new instances of the API used an identical underlying AMI (Machine Image) that has been used for a long time, these machines appeared to have corrupted filesystems and failed to launch AppGrid API services. This caused an endless loop of attempts to bring new machines and tearing down faulty ones, which in turn prevented the system to recover. This issue has also been escalated with AWS Support, as the system had been successfully rotating machines with this exact configuration for several months without any issues. --- Actions taken --- Moving forward, we are continuing our investigations working with AWS Support around the clock to find the underlying issue(s) - as well as ways to ensure we are not affected by similar issues in the future. We are currently updating AppGrid connection handling to mitigate communication issues with ElastiCache in this unresponsive state with open, faulty connections, which will protect us from this issue in the future. Finally, AppGrid is currently hosted in multiple availability zones in North America, meaning that AppGrid would still operate normally even in the rare event that a complete AWS data center goes down. In the long term, we are evaluating multi-regional load-balancing of AppGrid APIs, which would provide even higher redundancy and a global distribution of service availability.

  • Update
    Update

    Yesterday (March 13th) between 14:38:59 and 15:32:34 CET we experienced an unexpected issue with the AppGrid API. During this time interval, the majority of API requests could not be processed. The extent of the impact on client applications varied depending on the client caching strategy in place. In parallel to investigating the root cause, several measures were taken to resume normal operations for our services. Unfortunately, the outage was directly related to an issue with the underlying infrastructure and therefore we were not able to restore or re-create the affected services using our standard procedures. We are working closely with Amazon support to determine the root cause of the problem with the infrastructure. In the meantime, we have extra monitoring in place in order to secure stability. As soon as the root cause has been identified, we will provide a more detailed post mortem analysis as well as an overview on what steps we will be taking to avoid similar issues in the future.

  • Monitoring
    Monitoring

    The API is available again since 15:33 CET. We are still investigating the root cause of the outage and will provide more information shortly.

  • Update
    Update

    We are still investigating the major API outage effecting all regions and will provide updates continuously.

  • Investigating
    Investigating

    We're experiencing an elevated level of API errors since 14:38 CET and are currently looking into the issue.

Feb 2017

Elevated AppGrid API error rate and Admin console access issues
  • Resolved
    Resolved

    On Feb 10th, between 00:52 and 03:49 UTC, Amazon experienced network connectivity issues in the Northern California region that impacted some API calls sent to AppGrid. During the automatic failover between the different availability zones to avoid using the faulty network routes, session creations may have taken longer than expected or failed (failovers happened between 01:55 to 01:59 and 02:52 to 02:54). During this time window, AppGrid Admin console users may have experienced sporadic connection issues. The issue is now resolved. Please refer to the Amazon incident report on http://status.aws.amazon.com: 5:28 PM PST We are investigating network connectivity issues in a single Availability Zone in the US-WEST-1 Region. 6:15 PM PST We continue to investigate network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. 7:09 PM PST We have identified the root cause of the network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. Connectivity to some instances has been restored and we continue to work on the remaining instances. 8:10 PM PST Between 4:52 PM and 7:49 PM PST we experienced network connectivity issues for instances and failures of newly launched instances in a single Availability Zone in the US-WEST-1 Region. The issue has been resolved and the service is operating normally.

Feb 2017 to Apr 2017

Next