#RCA on AppGrid API elevated response times after 2.12.0 upgrade
# Executive Summary
As of April 25th 19:17 CEST, AppGrid has been upgraded to 2.12.0, a major release involving a larger infrastructural change, moving to a scalable micro-service architecture, relying on AWS ECS (Elastic Container Services). This release was an important milestone for the product as it allows us to more easily monitor, scale and upgrade micro-services independently.
Following the release, we have experienced two unforeseen issues that affected the AppGrid API response time and availability. Both issues are now resolved. For the full sequence of events and actions taken to prevent this from happening in the future, please refer to the below detailed root cause analysis.
# Root Cause Analysis
The AppGrid upgrade from 2.11.0 to 2.12.0 was completed on April 25th at 19:17 CEST, after a gradual switch from the old environment to the new starting at about 14:00 CEST. The system was being carefully monitored for any kind of issues relating to errors, elevated response times and other signs of issues both during the gradual switch as well as after the switch had been made. At 22:00 CEST, we noticed slightly increasing response times (250ms average) on a couple of database-intense API endpoints as traffic was increasing. At 22:30 the root cause was identified as the machines responsible for operating the databases being configured with slightly slower hard drives than intended (mechanical hard drives rather than SSD drives). Machines with SSD drives were rotated into the cluster and were in full effect at **approximately 23:30 CEST and the API response times were back to normal**. During this incident, the API was continuously serving requests but with a majority of requests experiencing an increase response time. Below, a graph of the elevated response times can be seen during the time window above. Note that times in the charts are presented in UTC.

On April 26th at roughly 10:00 CEST, we were notified of a non-production service integrating with AppGrid having trouble issuing certain CORS requests toward the /asset/file/ and /content/file/ API endpoint. After some investigation, we identified the issue as a change in how the API dealt with a specific type of CORS request when the request had previously been cached on the CloudFront CDN and a subsequent request came from a different domain origin. This resulted in production domains being most likely to be served files from the CDN, as requests from those domains were the ones being cached. Even though this should have affected only internal, non-production environments, the CloudFront CDN cache for these two endpoints was temporarily disabled at 11:29 CEST to remedy the issue, while the solution to the Cross-Origin API behaviour was being fixed. At this point in time, the API was under a level of load that the system was more than capable of handling without this feature in place. As a result of disabling it, these two endpoints experienced slightly higher response times. The system was being carefully monitored for service stability at this time.
At 14:30 CEST, we experienced an increase in traffic on the API at the same time as more resource-heavy requests being served, and with CloudFront caching disabled for the two affected endpoints, the API scaling policy had to be adjusted to cope with the extra traffic to remedy the increase in response time. At 15:01 CEST, the decision was taken to re-enable CloudFront CDN caching for the two endpoints while the fix to the root cause of the CORS issue was still being finalised and deployed. CloudFront changes were propagated globally, **a process that was complete at around 15:20 CEST**. The final solution to the initial CORS issue was deployed into production at 16:08 CEST and **the full effect of this should have been seen at around 16:15 CEST**. Below, a graph of the elevated response times can be seen during the time window above.

#Preventive actions
1. While the initial error relating to what hard drive types were allocated to the database machines was caused by human factors, we could have spotted this mistake earlier in our load testing scenarios. We have identified two discrepancies in the comparison of our load test scenarios with live traffic, one being that the system haven't been subjected to the exact same conditions for certain cache eviction operations. The second is that there were certain API traffic patterns that haven't been simulated properly when requesting certain API endpoints with a combination of multiple query parameters. While it is very hard to exactly mimic global API traffic patterns, we are currently working on improving our load test scenario to incorporate these types of scenarios as well.
2. The second issue relating to CORS handling that we were notified about on April 26th, although limited in severity and scope (due to the intricate set of conditions and affecting mainly non-primary web domains), should have been identified in during integration testing. The root cause of this issue being how the CloudFront cache was storing domain origins rather than wildcard has been fixed and the scenario has been added to the integration test suite. In retrospect, it was a bad decision to disable the CloudFront caching mechanism for these endpoints without first modifying the scaling policies for underlying services.
3. Finally, we have fine-tuned the auto-scaling policies in the new ECS cluster based on the findings in these two incidents to provide an even more reactive environment.
We once again want to apologize for the service degradation that was caused by these incidents and want to emphasize that service availability and stability continues to be our absolute highest priority.