We want to share more details with our customers and readers on the internet outages that occurred on the 20th of August 2021 and what we are doing to prevent these from happening again.
Incident
August 20th, at 8:12 UTC, our systems detected increased requests latency and total outage of one of our major nodes responsible for processing validation, redemption, and publication API methods. The node manages the customers, orders, vouchers, redemptions entities, and all operations related to them. This incident affected only tenants using the AS1 cluster (Singapore, Asia).
Impact on our customers
We saw a big increase in our 50Xs errors. Specifically, the 503 HTTP error indicates that our servers are unavailable. In this case, replicas of one of our services went down and didn’t restore properly after an automatic restart. As a result, the API hosted on the AS1 cluster ended up in the loop of restarting pods. That caused increased latency and possible request timeouts, which ultimately resulted in reduced availability for part of the API calls.
A number of the redemption API requests didn't go through because of this issue.
Source of the problem
One of our customers leveraging an account hosted on the AS1 cluster suddenly started sending a massive count of API calls. Unfortunately, the count of calls was overwhelming the limit allocated to his account, and our rate-limiter implemented in the API gateway didn't act properly.
The customer was always invoking the same API method. Unfortunately, for that path used by that customer, we had a bug in the API gateway (service which, among others, is responsible for controlling authentification and rate-limiting), resulting in a memory leak affecting the API gateway.
At the same time, our auto-scaling mechanism reacted too late and didn't set up additional resources as fast as was required.
Improvements
First of all, we have manually rescaled the resources to improve the responsiveness of our API gateway. The main problem was mitigated after that action. However, it was just a temporary solution, and our primary goal was to solve the memory leak issue as soon as possible. We had released the final improvement within 4 hours after the problem occurrence. We fully solved the problem at 12 pm UTC.
In the meantime, we reached out to a customer who was abusing API to notify him about incorrect integration.
As an ultimate improvement, we reconfigured the autoscaling mechanism on the AS1 cluster. After a couple of tests, we identified a proper set of parameters that should help us to react in similar cases in the future to keep the API gateway alive for a reasonable time window after recognizing the potential (memory leak) issue.
Summary
We understand how critical our infrastructure is for our customers’ businesses, and so we will continue to move towards completely automated systems to deal with this type of incident. Our goal is to minimize disruptions and outages for our customers regardless of the origin of the issue.