Notice history

Update

August 02, 2018 at 4:38 PM

Update

August 02, 2018 at 4:38 PM

# Root cause analysis of elevated API response times on August 31st On August 31st, AppGrid received a sustained massive increase of API requests from 10:00 to 19:00 CEST. This increase was due to multiple simultaneous live events with a very large number of concurrent users. With a large pool of users connecting over slow networks (most frequently mobile connections), the system also had to sustain a high number of simultaneous connections open for an extended period of time. This caused elevated response times between 13:31 and 14:40 CEST. To sustain the load over this long period of time, our services automatically scaled up. However the caching layer in our cloud infrastructure could not cope with this extraordinary load over this amount of time. Therefore we started to work on a secondary cache cluster to handle the load, which was deployed at 14:40 CEST and returned the API response times back to normal. Following this change, at 14:57 CEST, a small number of API requests started experiencing errors due to one out of 18 API routers being misconfigured. This misconfiguration was corrected at 15:27 CEST. # Preventive Actions We are currently optimizing our caching mechanism, allowing us to handle a higher throughput. We are also making additional changes to absorb sustained bursts of traffic more efficiently, using in-memory caches for certain entities. Both changes are scheduled for **Tuesday, September 5th**. Furthermore, we have isolated non-critical, asynchronous API endpoints subject to slower data transfers (specifically, application logs) to dedicated API routers. We will also investigate API rate limiting, as well as mechanisms for handling long-lived POST HTTP requests to further improve the service robustness during extreme traffic patterns over extended periods of time. We would like to apologize for the service disruption and want to emphasize that service security, stability and scalability are our highest priorities.

Resolved

August 31, 2017 at 1:35 PM

Resolved

August 31, 2017 at 1:35 PM

The solutions deployed to address the elevated response time have been actively monitored since deployment at approximately 14:40 and 15:27 respectively and all metrics are stable. We will share a postmortem shortly.

Update

August 31, 2017 at 1:31 PM

Update

August 31, 2017 at 1:31 PM

We have resolved the issue on the supporting cluster which affected a small amount of API requests. We will continue to work on a root cause analysis as well as ensuring API service stability. Note that currently, changes made in the Admin UI might take a little longer to propagate into the API.

Update

August 31, 2017 at 12:58 PM

Update

August 31, 2017 at 12:58 PM

We are investigating an issue on the supporting cluster deployed to resolve the elevated API response times that could cause a limited amount of API requests to fail. We will provide updates continuously.

Update

August 31, 2017 at 12:45 PM

Update

August 31, 2017 at 12:45 PM

At approximately 14:45 CEST, the solution to the elevated response time was in effect. Since this time, we are back to normal response times but will continue to actively pursuit the root cause of the issue, as well as make sure that we can handle bursts of the kind that we experienced today. We will continue to provide information if/when we have more to share.

Update

August 31, 2017 at 12:30 PM

Update

August 31, 2017 at 12:30 PM

The resolution to the elevated response time is currently being deployed. ETA 30 minutes. Until this is in full effect, some API requests will continue experiencing longer than normal response times. We will continue to provide information continuously.

Update

August 31, 2017 at 12:20 PM

Update

August 31, 2017 at 12:20 PM

We are currently working on a solution that will remedy the elevated response times on the API. We estimate that the solution will take approximately 1 hour to take full effect globally, and will update as soon as we have more information.

Update

August 31, 2017 at 12:07 PM

Update

August 31, 2017 at 12:07 PM

We are still investigating the issue causing the elevated response times on the API.

Monitoring

August 31, 2017 at 11:48 AM

Monitoring

August 31, 2017 at 11:48 AM

Due to a massive spike in requests over a very short time period, we are experiencing elevated response times on the API. Our infrastructure is currently scaling to absorb this burst traffic. We will continue to actively work on resolving the elevated response time and its root cause as soon as possible and will update when we have more information to share.

Identified

August 31, 2017 at 11:36 AM

Identified

August 31, 2017 at 11:36 AM

We are currently experiencing elevated response times on the API and are working to resolve it.

All systems operational

Oct 2017

Sep 2017

Aug 2017

Leyra - Notice history

All systems operational

Notice history

Oct 2017

Sep 2017

Aug 2017