## Summary
On October 23, at 11:54 UTC, an instance failure affected one of the AWS EC2 instances hosting our internal messaging system in the US cluster. That instance became almost completely unresponsive due to the underlying hardware issues. By design, several other nodes existed in the failed messaging cluster to take over connections should such a failure happen. At 12:01 UTC, the failure started causing disruptions both in the API and the admin console of Accedo One Control, and as a result, also affected other Accedo Products relying on it such as OTT Flow. The root cause was identified within a few minutes and at 12:18, large parts of the API were healthy again. Complete recovery was achieved at 12:32 UTC. Analytics data during the incident window may be unreliable.
This incident only affected customers using our US-based services. Customer services relying on our EU deployment of Accedo One were unaffected.
## Root Cause
Almost all services within Accedo One communicate with the internal messaging system, although it is not in the critical path for several of these services. To avoid the disruptions that may occur in the event of degradation of messaging cluster nodes or network failures, local buffers are in place in all of the services to handle temporary disconnects, network partitioning and to prevent data loss while preserving overall API availability and performance. Each service is seeded with at least 3 messaging nodes to choose from, distributed across multiple data centers, and instructed to fail-over from one node to another should one become unreachable. Such fail-overs happen regularly during maintenance windows, tests or unexpected issues in our infrastructure - although rare, they are not unusual and are completely transparent to our customers.
In the incident of October 23, services connecting to the messaging system did not detect the faulty messaging cluster node as unhealthy due to the fact that it was still handling some network traffic, although our engineers and cloud provider had no control/response from the underlying instance. Once the problem was identified, the messaging instance was completely taken out of service and recovery began with services progressively reconnecting to the other healthy nodes.
## Actions Taken
Our systems are designed from the ground-up with fault-tolerance in mind and such failures are part of our test scenarios. However, they have failed to identify the inability of the messaging driver used in our services to switch to different nodes should the connection become severely unreliable while not being completely unavailable. To avoid this and similar problems in the future, several actions were taken:
* All nodes in the messaging system were replaced with fresh instances
* The messaging nodes were following the maintenance track of the major version they were on. They have all been upgraded to a new major version that includes fixes for fail-over switching in adverse network conditions
* Instead of solely relying on the messaging driver's ability to detect and shift load away from unhealthy nodes, DNS-based randomization of the messaging node picked by each service when connecting or reconnecting was added and will help prevent individual node failures from impacting many deployed services and containers. While possibly degrading the service performance in such events as observed on October 23rd - it will prevent widespread disruptions, such as experienced in the said event.
We apologize for any inconvenience this may have caused. Reliability and stability of the services is our highest priority and we are constantly improving our infrastructure and routines.