Leyra - Notice history

All systems operational

Admin Console - Operational

100% - uptime
Oct 2020 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2020
Nov 2020
Dec 2020

Delivery API - Operational

100% - uptime
Oct 2020 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2020
Nov 2020
Dec 2020

Web Platform - Operational

100% - uptime
Oct 2020 · 100.0%Nov · 100.0%Dec · 100.0%
Oct 2020
Nov 2020
Dec 2020

Notice history

Dec 2020

No notices reported this month

Nov 2020

Accedo One Control - random 404 errors for Admin Console and API
  • Update
    Update

    ## **Summary** Between Nov 16 21:00 UTC and Nov 17 09:45 UTC, Accedo One Control Delivery API, Management API and Admin Console sporadically returned errors to requests that were otherwise valid. Errors returned to clients majorly included 404s, marginally causing a few extra 401s and 403s. It is estimated that up to 5% of the total requests received by Control Delivery API were affected by the issue over the incident window, causing degraded or sporadically interrupted service to applications relying on it. The problem was first reported to our service management team on Nov 17, 06:07 UTC. Investigation started by the engineering team at 06:22 UTC. The severity of the issue led to a Statuspage update to all customers at 07:18 UTC, and a solution found at 08:57 UTC, which progressively led to complete resolution by Nov 17, 09:45 UTC. This incident only affected customers using our US-based services. Customer services relying on our EU deployment of Accedo One were unaffected. ## **Root cause** Accedo One Control is deployed like many other cloud solutions in a containerized environment: hundreds of containers, fulfilling various features of the product, talk to one another to generate the response to requests sent by client applications. As traffic increases and decreases, more or fewer containers are put in service to deal in a swift manner with the incoming requests. Containers are hosted on virtual machines \(EC2\) within an AWS cloud environment, that run the operating system and the container orchestration software that schedules and terminates containers as needed to follow the load. On Nov 16, around 20:55 UTC, a specific EC2 machine out of a fleet of a hundred instances started experiencing abnormally high memory usage. Being unable to free memory, applications running both in containers and on the underlying host started to encounter errors that led to out of memory termination, including the container management system on the virtual machine. The container management application restarted on the instance immediately, scheduling new containers to run alongside one old container that survived the memory duress episode. As part of the investigation it was discovered that while the container scheduling system started new containers on the instance normally, the fulfillment of these requests by the container management service led to unexpected IP address reuse locally on the network bridge shared by the containers. The new container scheduled to start on the instance after the memory duress episode was assigned the same IP address as the surviving container. The effects of this completely undesired behavior are not immediately obvious: both containers host HTTP servers but expose a different set of endpoints, partially overlapping. During this scenario, packets delivered over the network bridge to the conflicting IP address may be randomly routed to either container, thus leading to a mix of correct and erroneous responses, that may include 404s should the wrong container be hit. ## **Actions taken** While the application container that ran out of memory is considered the actual root cause of the issue, it triggered an unfortunate bug in the container management system. This bug in the container management system is considered critical by us as it leads to serious, silent and therefore very hard to detect and troubleshoot issues. Following actions are taken to address the set of identified problems above: * Prevent the restart of the container management system should it be killed for any reason \(including container or system memory issues\). As a result, this would lead to the underlying EC2 instance to be terminated and replaced with a new one. IP addresses can therefore not overlap as no containers would exist on the new machine at the time of container management system startup * Tighten the memory reservations on several of the containers deployed in our fleet to prevent machines from experiencing such low memory availability problem * Identify the reason why the specific container that caused the problem is using that much memory and address it * Report the issue to the community maintaining the container management system we use, after our cloud provider confirmed our engineers’ analysis of the issue It should be noted that 404, 401 and 403 errors are expected as part of the normal Accedo One Control API operation, and that the rate at which these errors are returned to requests varies greatly over the course of a day, and day of the week. The change in error rate during this incident was not significant enough to trigger automatic alerts to be sent out to on-call engineers, thus delaying the issue investigation and resolution. The overall impact on customers was also relatively low as it is estimated that less than 5% of all requests would receive an unexpected error. With the errors being client \(HTTP 4xx\) and not server errors \(HTTP 5xx\), no change to our alerting policy was implemented at this time, as it would lead to too many false-positives. We apologize for any inconvenience this may have caused, and are continuing to improve our infrastructure’s reliability and fault tolerance.

  • Resolved
    Resolved

    This incident has been resolved.

  • Monitoring
    Monitoring

    We have now implemented a solution for the elevated error rates, and are now seeing the system returning to normal operation. We will continue monitoring the situation to ensure full resolution has been reached.

  • Identified
    Identified

    We have identified the source of disturbance for the API calls and increased 404 error rates and are working on the solution.

  • Update
    Update

    We are continuing to investigate the issues that result in some customers seeing sporadic 404 errors on API requests.

  • Investigating
    Investigating

    We are investigating reason of getting 404 errors.

Oct 2020

Accedo One - Elevated API Errors
  • Update
    Update

    ## Summary On October 23, at 11:54 UTC, an instance failure affected one of the AWS EC2 instances hosting our internal messaging system in the US cluster. That instance became almost completely unresponsive due to the underlying hardware issues. By design, several other nodes existed in the failed messaging cluster to take over connections should such a failure happen. At 12:01 UTC, the failure started causing disruptions both in the API and the admin console of Accedo One Control, and as a result, also affected other Accedo Products relying on it such as OTT Flow. The root cause was identified within a few minutes and at 12:18, large parts of the API were healthy again. Complete recovery was achieved at 12:32 UTC. Analytics data during the incident window may be unreliable. This incident only affected customers using our US-based services. Customer services relying on our EU deployment of Accedo One were unaffected. ## Root Cause Almost all services within Accedo One communicate with the internal messaging system, although it is not in the critical path for several of these services. To avoid the disruptions that may occur in the event of degradation of messaging cluster nodes or network failures, local buffers are in place in all of the services to handle temporary disconnects, network partitioning and to prevent data loss while preserving overall API availability and performance. Each service is seeded with at least 3 messaging nodes to choose from, distributed across multiple data centers, and instructed to fail-over from one node to another should one become unreachable. Such fail-overs happen regularly during maintenance windows, tests or unexpected issues in our infrastructure - although rare, they are not unusual and are completely transparent to our customers. In the incident of October 23, services connecting to the messaging system did not detect the faulty messaging cluster node as unhealthy due to the fact that it was still handling some network traffic, although our engineers and cloud provider had no control/response from the underlying instance. Once the problem was identified, the messaging instance was completely taken out of service and recovery began with services progressively reconnecting to the other healthy nodes. ## Actions Taken Our systems are designed from the ground-up with fault-tolerance in mind and such failures are part of our test scenarios. However, they have failed to identify the inability of the messaging driver used in our services to switch to different nodes should the connection become severely unreliable while not being completely unavailable. To avoid this and similar problems in the future, several actions were taken: * All nodes in the messaging system were replaced with fresh instances * The messaging nodes were following the maintenance track of the major version they were on. They have all been upgraded to a new major version that includes fixes for fail-over switching in adverse network conditions * Instead of solely relying on the messaging driver's ability to detect and shift load away from unhealthy nodes, DNS-based randomization of the messaging node picked by each service when connecting or reconnecting was added and will help prevent individual node failures from impacting many deployed services and containers. While possibly degrading the service performance in such events as observed on October 23rd - it will prevent widespread disruptions, such as experienced in the said event. We apologize for any inconvenience this may have caused. Reliability and stability of the services is our highest priority and we are constantly improving our infrastructure and routines.

  • Resolved
    Resolved

    The incident is resolved. Further information and steps taken to prevent it from reoccurring will be posted later on this incident entry.

  • Update
    Update

    We are continuing to monitor for any further issues.

  • Monitoring
    Monitoring

    A fix has been implemented and we are monitoring the results.

  • Update
    Update

    We are continuing to investigate this issue.

  • Investigating
    Investigating

    We are currently experiencing an elevated level of API errors and are currently looking into the issue. We will provide more updates as soon as possible.

Oct 2020 to Dec 2020

Next