On Friday, July 6 at 16:00 CEST (14:00 UTC) the Accedo One team observed an increasing amount of errors on Publish API endpoints and elevated response times.
#Incident Details
During the investigation, the issue was identified to be caused by an abnormally large increase of traffic on Publish API endpoints. Observed traffic was approximately ten times the overall normal traffic on the system (including high-load events, such as World Cup 2018) with majority of the traffic targeting Publish endpoints.
#Resolution Details
As part of an immediate resolution, an additional short-lived cache tier was placed in front of Publish endpoints. Additional compute capacity was also deployed to offload the underlying services and allow systems to return back to a regular operational state.
At 17:09 CEST (15:09 UTC) all Accedo One APIs returned fully to their operational state.
#Mitigation & Planned Actions
The changes implemented during last night’s incident remain in place, as well as new fixes to be put in place, as outlined below.
Accedo One has identified opportunities for scalability improvements for Publish endpoints and will immediately prioritize improvements to the related services. These changes will provide better availability during abnormal load spikes for Publish endpoints, with the intended goal of minimizing service degradation times. These upgrades will be rolled out as soon as possible, but not before the end of the World Cup.
Additionally, the Accedo One team is investigating other complimentary solutions, such as rate limiting, to improve resilience of the entire system and increase protection of our customers.
Specifically, during the remaining World Cup games, which are expected to provide significant load spikes, we will continue to have high operational capacity, including additional live monitoring capabilities, and will deploy short-lived caching and additional compute capacity which were successful mitigation effects. This will resolve any issues for the remaining games.
With all distributed systems, it is always recommended to retain a fallback cache in the middleware and implement an incremental back-off mechanism to allow for systems to recover in case of any failures, especially due to excessive spikes in load.