Elevated API Errors

Update

August 02, 2018 at 4:38 PM

Update

August 02, 2018 at 4:38 PM

On February 21st, at approximately 18:25 CET the Accedo One API received a massive amount of application logs into the Detect API endpoints. This massive ramp-up happening over a very short time period on this specific endpoint resulted in the API routing becoming a bottleneck, increasing the error rate on other parts of the API. To support these traffic patterns, requests were split on the CDN (CloudFront) and a dedicated set of services were deployed to handle this traffic (application logs). Unfortunately, as this type of change on the CDN takes some time to propagate globally, the API response times and error rate came back to normal levels at 18:40 CET. With this solution in place, the system has once more been subjected to similar loads on Feb 24th, which it handled without any downtime or increased error rate. This can be seen in the image below, where the traffic spike on the 24th reached the same levels as were seen on the 18th and 21st, while the system operated with normal API response times and error rates. The solution described above has therefore been put into permanent effect. ![error rate](https://s3-us-west-1.amazonaws.com/appgrid-prod-email-assets/statusPage/5XX-over-T20.png "Error rate") **Traffic spike on February 24th shows no errors reported.** ### Actions Taken * **Done:** Larger request bodies for Accedo Detect (HTTP POST requests) are permanently being re-directed on CloudFront to its own dedicated set of API routers and underlying services to separate the long-running, non-critical API requests from app-critical requests such as content- and configuration fetching. * **Done:** Several caching improvements have been rolled out to make the system even more stable when dealing with high traffic volumes on the API at the same time as the underlying data being changed from the Admin UI. * **Done:** Several additional improvements have been made to the session management layer, which allows for us to scale even more rapidly and provide a highly available system even if applications. * **In progress:** As was mentioned in the previous post mortem, we are continuing on the reworked API output modelling strategy as well as a new session management layer, which will further increase the performance of the API during these extremely high loads. We want to apologize for the inconvenience caused over these two incidents, and want to emphasize that our full focus has been on addressing the root causes since the first occurrence. Although we managed to roll out several key improvements over the course of a few days, with the February 21st traffic spike coming just 3 days after the initial one, we didn't have time to roll out all of the short-term solution that were identified in the post mortem. We are however confident that with the changes that have now been rolled out, and with the improvements coming up in both the short- and long-term, we will be able to confidently handle this type of traffic spikes going forward.

Resolved

February 21, 2018 at 8:27 PM

Resolved

February 21, 2018 at 8:27 PM

The degraded response times and in some cases elevated error rate on the Accedo One API observed between 18:25 and 18:40 CET is now marked as resolved after being closely monitored since the solution was rolled out. We apologise for this service degradation which was caused by a massive traffic spike over a very narrow time frame from which the infrastructure did not manage to scale quickly enough. We will continue our work on the action points specified in the previous post mortem analysis to further strengthen service robustness in case of these extreme surges in traffic, some of which will be rolled out later this week.

Monitoring

February 21, 2018 at 5:59 PM

Monitoring

February 21, 2018 at 5:59 PM

The solution to the elevated API errors has been rolled out as of approximately 18:40 CET and service health check metrics are back to normal. We will continue to closely monitor the API for stability and provide updates as necessary.

Identified

February 21, 2018 at 5:41 PM

Identified

February 21, 2018 at 5:41 PM

We are currently experiencing a massive surge of a particular type of API requests and therefore, we are experiencing an elevated level of API errors. We are working on a solution that is estimated to be in place in the next 15 minutes.

Leyra - Elevated API Errors – Incident details

All systems operational