Notice history

On February 21st, at approximately 18:25 CET the Accedo One API received a massive amount of application logs into the Detect API endpoints. This massive ramp-up happening over a very short time period on this specific endpoint resulted in the API routing becoming a bottleneck, increasing the error rate on other parts of the API. To support these traffic patterns, requests were split on the CDN (CloudFront) and a dedicated set of services were deployed to handle this traffic (application logs). Unfortunately, as this type of change on the CDN takes some time to propagate globally, the API response times and error rate came back to normal levels at 18:40 CET. With this solution in place, the system has once more been subjected to similar loads on Feb 24th, which it handled without any downtime or increased error rate. This can be seen in the image below, where the traffic spike on the 24th reached the same levels as were seen on the 18th and 21st, while the system operated with normal API response times and error rates. The solution described above has therefore been put into permanent effect. ![error rate](https://s3-us-west-1.amazonaws.com/appgrid-prod-email-assets/statusPage/5XX-over-T20.png "Error rate") **Traffic spike on February 24th shows no errors reported.** ### Actions Taken * **Done:** Larger request bodies for Accedo Detect (HTTP POST requests) are permanently being re-directed on CloudFront to its own dedicated set of API routers and underlying services to separate the long-running, non-critical API requests from app-critical requests such as content- and configuration fetching. * **Done:** Several caching improvements have been rolled out to make the system even more stable when dealing with high traffic volumes on the API at the same time as the underlying data being changed from the Admin UI. * **Done:** Several additional improvements have been made to the session management layer, which allows for us to scale even more rapidly and provide a highly available system even if applications. * **In progress:** As was mentioned in the previous post mortem, we are continuing on the reworked API output modelling strategy as well as a new session management layer, which will further increase the performance of the API during these extremely high loads. We want to apologize for the inconvenience caused over these two incidents, and want to emphasize that our full focus has been on addressing the root causes since the first occurrence. Although we managed to roll out several key improvements over the course of a few days, with the February 21st traffic spike coming just 3 days after the initial one, we didn't have time to roll out all of the short-term solution that were identified in the post mortem. We are however confident that with the changes that have now been rolled out, and with the improvements coming up in both the short- and long-term, we will be able to confidently handle this type of traffic spikes going forward.

Yesterday (February 18th) between 14:21 and 14:54 CET, the Accedo One API exhibited slower response times and at a certain point, returned errors for a subset of API calls being issued to the system. The triggering factor for the issue was a substantial load on the API, lasting between 13:39 and 14:54 CET in combination with several changes made in the Admin UI by editors in rapid succession in the organisations which caused the high traffic levels. These changes made to the underlying datasets evicted the cached data, letting through additional traffic to the underlying system. At 14:20 CET, one specific micro-service was overwhelmed when the data it was serving was repeatedly changed at the same time as the traffic suddenly burst through its previous high levels. This can be seen in the large anomaly in the purple line in the chart below starting at 14:20 CET, from which the traffic quickly went from 400 000 requests per minute (rpm), to above 1.1 million rpm in under 10 minutes. Note that timestamps in the graph are presented in **UTC**. ![API request count vs. 2XX API responses](https://s3-us-west-1.amazonaws.com/appgrid-prod-email-assets/statusPage/api-router+RPMvs2XX.png) **API Request Count vs. API Requests with HTTP 200 OK response** The high latency of this service that this sudden burst created had a negative cascading effect on the session management component. This component itself then suffered from higher latency, which meant that its health checks would not pass on newly started instances as we scaled up. For approximately 30 minutes from this point, the API service performance was substantially decreased (as can be seen in the orange line in the chart above), albeit still properly serving upwards of 250 000 requests per minute intermittently. At 14:54 CET, the affected service managed to scale to a high enough instance count that it managed to serve all the incoming API traffic in a stable and performant way. The system has been under close monitoring since then but no further issues have been identified. ##Actions Taken Several action points have been identified based on the scenario we experienced yesterday: * **Immediate:** A new internal procedure has been implemented for configuring health checks parameters during high traffic events to allow for services to serve data at a slower (but still reasonable) rate. * **Immediate:** More standardised status page communication templates are being put in place for quicker feedback during service disruptions occurring out of office hours. * **Immediate:** One of the immediate action points are that we will be rolling out over this week is a smarter way of caching and evicting data, which will make the system less prone to experiencing issues when dealing with high traffic volumes on the API at the same time as the underlying data being changed from the Admin UI (resulting caches being evicted). * **Short term:** In parallel to the above, something that has already been in the making for a while (now being fast-tracked) is a reworked API output modelling strategy, which will further increase the performance of the API during these extremely high loads. * **Short term:** Another topic for the near term is to enable smarter caching on the session layer, when the same device is re-connecting with the same UUID and Application Key in a short time span, reducing the stress on that part of the system during high load. * **Short term:** One of the services that experienced issues in the incident yesterday was the component responsible for session creation. There is a new, completely rebuilt internal mechanism for session creation in the works, which are already showing very promising performance results. * **Long term:** In the slightly longer term, we will also be deploying the API in multiple regions, on top of the multiple availability zone (HA) distribution we have in place today. * **Long-term:** We will also be putting in place API rate limiting for certain parts of the system, to ensure a fair distribution of capacity across all customers and, in the end, the end users. This will lessen the risk of the API suffering from being overwhelmed, and will lessen the impact on end users in case of extremely high loads such as the one experienced on Sunday. We want to take this opportunity to apologise for this service disruption and will continue with the highest priority set on addressing the above mentioned action points.

All systems operational

Mar 2018

Feb 2018

Jan 2018

Leyra - Notice history

All systems operational

Notice history

Mar 2018

Feb 2018

Jan 2018