Leyra - Notice history

All systems operational

Admin Console - Operational

100% - uptime
Aug 2017 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2017
Sep 2017
Oct 2017

Delivery API - Operational

100% - uptime
Aug 2017 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2017
Sep 2017
Oct 2017

Web Platform - Operational

100% - uptime
Aug 2017 · 100.0%Sep · 100.0%Oct · 100.0%
Aug 2017
Sep 2017
Oct 2017

Notice history

Aug 2017

AppGrid: Elevated API response time
  • Update
    Update

    # Root cause analysis of elevated API response times on August 31st On August 31st, AppGrid received a sustained massive increase of API requests from 10:00 to 19:00 CEST. This increase was due to multiple simultaneous live events with a very large number of concurrent users. With a large pool of users connecting over slow networks (most frequently mobile connections), the system also had to sustain a high number of simultaneous connections open for an extended period of time. This caused elevated response times between 13:31 and 14:40 CEST. To sustain the load over this long period of time, our services automatically scaled up. However the caching layer in our cloud infrastructure could not cope with this extraordinary load over this amount of time. Therefore we started to work on a secondary cache cluster to handle the load, which was deployed at 14:40 CEST and returned the API response times back to normal. Following this change, at 14:57 CEST, a small number of API requests started experiencing errors due to one out of 18 API routers being misconfigured. This misconfiguration was corrected at 15:27 CEST. # Preventive Actions We are currently optimizing our caching mechanism, allowing us to handle a higher throughput. We are also making additional changes to absorb sustained bursts of traffic more efficiently, using in-memory caches for certain entities. Both changes are scheduled for **Tuesday, September 5th**. Furthermore, we have isolated non-critical, asynchronous API endpoints subject to slower data transfers (specifically, application logs) to dedicated API routers. We will also investigate API rate limiting, as well as mechanisms for handling long-lived POST HTTP requests to further improve the service robustness during extreme traffic patterns over extended periods of time. We would like to apologize for the service disruption and want to emphasize that service security, stability and scalability are our highest priorities.

  • Resolved
    Resolved

    The solutions deployed to address the elevated response time have been actively monitored since deployment at approximately 14:40 and 15:27 respectively and all metrics are stable. We will share a postmortem shortly.

  • Update
    Update

    We have resolved the issue on the supporting cluster which affected a small amount of API requests. We will continue to work on a root cause analysis as well as ensuring API service stability. Note that currently, changes made in the Admin UI might take a little longer to propagate into the API.

  • Update
    Update

    We are investigating an issue on the supporting cluster deployed to resolve the elevated API response times that could cause a limited amount of API requests to fail. We will provide updates continuously.

  • Update
    Update

    At approximately 14:45 CEST, the solution to the elevated response time was in effect. Since this time, we are back to normal response times but will continue to actively pursuit the root cause of the issue, as well as make sure that we can handle bursts of the kind that we experienced today. We will continue to provide information if/when we have more to share.

  • Update
    Update

    The resolution to the elevated response time is currently being deployed. ETA 30 minutes. Until this is in full effect, some API requests will continue experiencing longer than normal response times. We will continue to provide information continuously.

  • Update
    Update

    We are currently working on a solution that will remedy the elevated response times on the API. We estimate that the solution will take approximately 1 hour to take full effect globally, and will update as soon as we have more information.

  • Update
    Update

    We are still investigating the issue causing the elevated response times on the API.

  • Monitoring
    Monitoring

    Due to a massive spike in requests over a very short time period, we are experiencing elevated response times on the API. Our infrastructure is currently scaling to absorb this burst traffic. We will continue to actively work on resolving the elevated response time and its root cause as soon as possible and will update when we have more information to share.

  • Identified
    Identified

    We are currently experiencing elevated response times on the API and are working to resolve it.

Aug 2017 to Oct 2017

Next