We would like to share more details with our customers and readers on the internet outages that occurred yesterday morning, and what we are doing now to prevent these from happening again.
Yesterday, March 15th, at 12:15 UTC, our systems detected increased requests latency and total outage of one of our major nodes responsible for processing customer data. The node manages customer entity and all operations related to it. The entity is where you store properties about a specific customer - things like an email address, where they came from, their age. These properties allow you to filter out specific segments of your customers later on. It is closely related to redemption API method.
We saw a big increase in our 503s errors. A 503 HTTP error indicates that our servers are unavailable. In this case, one of our services went down and didn’t restore after an automatic restart. At the same time, we identified an issue with routing connections to Heroku Dynos within EU region. These problems caused increased latency and possible request timeouts which ultimately resulted in reduced availability for part of the API calls.
A number of the redemption API requests didn't go through because of this issue. Moreover, customer API wasn’t available at that time.
Communication in this kind of incident is crucial and difficult at the same time. Our customers understandably expect prompt and accurate information and want the impact to be stopped as soon as possible. In the incident, we identified weaknesses in our communication: the incident was identified 4 hours after the first occurrence, and our response time was not adequate. Delayed response time was due to closed business hours, which are from 6 AM to 10 PM CET.
We want to reassure you that we are taking all the steps to improve our communication, including implementation of automated detection and mitigation systems that can react much quicker than any human operator. We already have such systems in place but they are operating only for crucial verticals. We had been actively testing their accuracy and efficacy before turning them for the whole system yesterday evening.
We know how important it is to communicate on our status page. We heard from our customers and took the necessary steps to improve our communication. Our support team is working on improvements in how we update our status page and how we review the content for accuracy as well as transparency.
We understand how critical our infrastructure is for our customers’ businesses, and so, we will continue to move towards completely automated systems to deal with this type of incidents. Our goal is to minimise disruptions and outages for our customers regardless of the origin of the issue.