# Root cause analysis of elevated API response times on August 31st
On August 31st, AppGrid received a sustained massive increase of API requests from 10:00 to 19:00 CEST. This increase was due to multiple simultaneous live events with a very large number of concurrent users. With a large pool of users connecting over slow networks (most frequently mobile connections), the system also had to sustain a high number of simultaneous connections open for an extended period of time. This caused elevated response times between 13:31 and 14:40 CEST.
To sustain the load over this long period of time, our services automatically scaled up. However the caching layer in our cloud infrastructure could not cope with this extraordinary load over this amount of time. Therefore we started to work on a secondary cache cluster to handle the load, which was deployed at 14:40 CEST and returned the API response times back to normal.
Following this change, at 14:57 CEST, a small number of API requests started experiencing errors due to one out of 18 API routers being misconfigured. This misconfiguration was corrected at 15:27 CEST.
# Preventive Actions
We are currently optimizing our caching mechanism, allowing us to handle a higher throughput. We are also making additional changes to absorb sustained bursts of traffic more efficiently, using in-memory caches for certain entities.
Both changes are scheduled for **Tuesday, September 5th**. Furthermore, we have isolated non-critical, asynchronous API endpoints subject to slower data transfers (specifically, application logs) to dedicated API routers.
We will also investigate API rate limiting, as well as mechanisms for handling long-lived POST HTTP requests to further improve the service robustness during extreme traffic patterns over extended periods of time.
We would like to apologize for the service disruption and want to emphasize that service security, stability and scalability are our highest priorities.