Elevated API Errors

Update

August 02, 2018 at 4:38 PM

Update

August 02, 2018 at 4:38 PM

On Friday, July 6 at 16:00 CEST (14:00 UTC) the Accedo One team observed an increasing amount of errors on Publish API endpoints and elevated response times. #Incident Details During the investigation, the issue was identified to be caused by an abnormally large increase of traffic on Publish API endpoints. Observed traffic was approximately ten times the overall normal traffic on the system (including high-load events, such as World Cup 2018) with majority of the traffic targeting Publish endpoints. #Resolution Details As part of an immediate resolution, an additional short-lived cache tier was placed in front of Publish endpoints. Additional compute capacity was also deployed to offload the underlying services and allow systems to return back to a regular operational state. At 17:09 CEST (15:09 UTC) all Accedo One APIs returned fully to their operational state. #Mitigation & Planned Actions The changes implemented during last night’s incident remain in place, as well as new fixes to be put in place, as outlined below. Accedo One has identified opportunities for scalability improvements for Publish endpoints and will immediately prioritize improvements to the related services. These changes will provide better availability during abnormal load spikes for Publish endpoints, with the intended goal of minimizing service degradation times. These upgrades will be rolled out as soon as possible, but not before the end of the World Cup. Additionally, the Accedo One team is investigating other complimentary solutions, such as rate limiting, to improve resilience of the entire system and increase protection of our customers. Specifically, during the remaining World Cup games, which are expected to provide significant load spikes, we will continue to have high operational capacity, including additional live monitoring capabilities, and will deploy short-lived caching and additional compute capacity which were successful mitigation effects. This will resolve any issues for the remaining games. With all distributed systems, it is always recommended to retain a fallback cache in the middleware and implement an incremental back-off mechanism to allow for systems to recover in case of any failures, especially due to excessive spikes in load.

Resolved

July 06, 2018 at 4:15 PM

Resolved

July 06, 2018 at 4:15 PM

This incident has been resolved.

Monitoring

July 06, 2018 at 3:37 PM

Monitoring

July 06, 2018 at 3:37 PM

With help of measures that were put in place, Accedo One APIs have returned to an operation state. We continue monitoring the situation.

Update

July 06, 2018 at 3:13 PM

Update

July 06, 2018 at 3:13 PM

We are still working on resolving the issue with the elevated response times and error rates of the Publish endpoints. Several measures are being put in place and we will update the status as soon as more information is available.

Identified

July 06, 2018 at 2:44 PM

Identified

July 06, 2018 at 2:44 PM

We have identified the problem that causes API requests to Publish endpoints to have an increased error rate and response times and are currently working on the resolution.

Investigating

July 06, 2018 at 2:17 PM

Investigating

July 06, 2018 at 2:17 PM

We are currently experiencing an elevated level of API errors and are currently looking into the issue. We will provide more updates as soon as possible.

Leyra - Elevated API Errors – Incident details

All systems operational