Accedo One Control - random 404 errors for Admin Console and API - Incident details

Update

November 23, 2020 at 10:32 AM

Update

November 23, 2020 at 10:32 AM

## **Summary** Between Nov 16 21:00 UTC and Nov 17 09:45 UTC, Accedo One Control Delivery API, Management API and Admin Console sporadically returned errors to requests that were otherwise valid. Errors returned to clients majorly included 404s, marginally causing a few extra 401s and 403s. It is estimated that up to 5% of the total requests received by Control Delivery API were affected by the issue over the incident window, causing degraded or sporadically interrupted service to applications relying on it. The problem was first reported to our service management team on Nov 17, 06:07 UTC. Investigation started by the engineering team at 06:22 UTC. The severity of the issue led to a Statuspage update to all customers at 07:18 UTC, and a solution found at 08:57 UTC, which progressively led to complete resolution by Nov 17, 09:45 UTC. This incident only affected customers using our US-based services. Customer services relying on our EU deployment of Accedo One were unaffected. ## **Root cause** Accedo One Control is deployed like many other cloud solutions in a containerized environment: hundreds of containers, fulfilling various features of the product, talk to one another to generate the response to requests sent by client applications. As traffic increases and decreases, more or fewer containers are put in service to deal in a swift manner with the incoming requests. Containers are hosted on virtual machines \(EC2\) within an AWS cloud environment, that run the operating system and the container orchestration software that schedules and terminates containers as needed to follow the load. On Nov 16, around 20:55 UTC, a specific EC2 machine out of a fleet of a hundred instances started experiencing abnormally high memory usage. Being unable to free memory, applications running both in containers and on the underlying host started to encounter errors that led to out of memory termination, including the container management system on the virtual machine. The container management application restarted on the instance immediately, scheduling new containers to run alongside one old container that survived the memory duress episode. As part of the investigation it was discovered that while the container scheduling system started new containers on the instance normally, the fulfillment of these requests by the container management service led to unexpected IP address reuse locally on the network bridge shared by the containers. The new container scheduled to start on the instance after the memory duress episode was assigned the same IP address as the surviving container. The effects of this completely undesired behavior are not immediately obvious: both containers host HTTP servers but expose a different set of endpoints, partially overlapping. During this scenario, packets delivered over the network bridge to the conflicting IP address may be randomly routed to either container, thus leading to a mix of correct and erroneous responses, that may include 404s should the wrong container be hit. ## **Actions taken** While the application container that ran out of memory is considered the actual root cause of the issue, it triggered an unfortunate bug in the container management system. This bug in the container management system is considered critical by us as it leads to serious, silent and therefore very hard to detect and troubleshoot issues. Following actions are taken to address the set of identified problems above: * Prevent the restart of the container management system should it be killed for any reason \(including container or system memory issues\). As a result, this would lead to the underlying EC2 instance to be terminated and replaced with a new one. IP addresses can therefore not overlap as no containers would exist on the new machine at the time of container management system startup * Tighten the memory reservations on several of the containers deployed in our fleet to prevent machines from experiencing such low memory availability problem * Identify the reason why the specific container that caused the problem is using that much memory and address it * Report the issue to the community maintaining the container management system we use, after our cloud provider confirmed our engineers’ analysis of the issue It should be noted that 404, 401 and 403 errors are expected as part of the normal Accedo One Control API operation, and that the rate at which these errors are returned to requests varies greatly over the course of a day, and day of the week. The change in error rate during this incident was not significant enough to trigger automatic alerts to be sent out to on-call engineers, thus delaying the issue investigation and resolution. The overall impact on customers was also relatively low as it is estimated that less than 5% of all requests would receive an unexpected error. With the errors being client \(HTTP 4xx\) and not server errors \(HTTP 5xx\), no change to our alerting policy was implemented at this time, as it would lead to too many false-positives. We apologize for any inconvenience this may have caused, and are continuing to improve our infrastructure’s reliability and fault tolerance.

Resolved

November 17, 2020 at 11:33 AM

Resolved

November 17, 2020 at 11:33 AM

This incident has been resolved.

Monitoring

November 17, 2020 at 10:19 AM

Monitoring

November 17, 2020 at 10:19 AM

We have now implemented a solution for the elevated error rates, and are now seeing the system returning to normal operation. We will continue monitoring the situation to ensure full resolution has been reached.

Identified

November 17, 2020 at 9:57 AM

Identified

November 17, 2020 at 9:57 AM

We have identified the source of disturbance for the API calls and increased 404 error rates and are working on the solution.

Update

November 17, 2020 at 9:27 AM

Update

November 17, 2020 at 9:27 AM

We are continuing to investigate the issues that result in some customers seeing sporadic 404 errors on API requests.

Investigating

November 17, 2020 at 8:18 AM

Investigating

November 17, 2020 at 8:18 AM

We are investigating reason of getting 404 errors.

Leyra - Accedo One Control - random 404 errors for Admin Console and API – Incident details

All systems operational

Accedo One Control - random 404 errors for Admin Console and API