A post-mortem on yesterday morning's incident

We would like to share more details with our customers and readers on the internet outages that occurred yesterday morning, and what we are doing now to prevent these from happening again.

Incident

Yesterday, March 15th, at 12:15 UTC, our systems detected increased requests latency and total outage of one of our major nodes responsible for processing customer data. The node manages customer entity and all operations related to it. The entity is where you store properties about a specific customer - things like an email address, where they came from, their age. These properties allow you to filter out specific segments of your customers later on. It is closely related to redemption API method.

Impact on our customers

We saw a big increase in our 503s errors. A 503 HTTP error indicates that our servers are unavailable. In this case, one of our services went down and didn’t restore after an automatic restart. At the same time, we identified an issue with routing connections to Heroku Dynos within EU region. These problems caused increased latency and possible request timeouts which ultimately resulted in reduced availability for part of the API calls.

A number of the redemption API requests didn't go through because of this issue. Moreover, customer API wasn’t available at that time.

Communication improvement

Communication in this kind of incident is crucial and difficult at the same time. Our customers understandably expect prompt and accurate information and want the impact to be stopped as soon as possible. In the incident, we identified weaknesses in our communication: the incident was identified 4 hours after the first occurrence, and our response time was not adequate. Delayed response time was due to closed business hours, which are from 6 AM to 10 PM CET.

We want to reassure you that we are taking all the steps to improve our communication, including implementation of automated detection and mitigation systems that can react much quicker than any human operator. We already have such systems in place but they are operating only for crucial verticals. We had been actively testing their accuracy and efficacy before turning them for the whole system yesterday evening.

We know how important it is to communicate on our status page. We heard from our customers and took the necessary steps to improve our communication. Our support team is working on improvements in how we update our status page and how we review the content for accuracy as well as transparency.

Summary

We understand how critical our infrastructure is for our customers’ businesses, and so, we will continue to move towards completely automated systems to deal with this type of incidents. Our goal is to minimise disruptions and outages for our customers regardless of the origin of the issue.

Posted Mar 16, 2017 - 16:11 CET

Resolved

This issue is now resolved.

Posted Mar 15, 2017 - 12:42 CET

Update

We have identified an issue with routing connections to servers within EU region. Affected applications will have increased latency of requests to API. We are working to resolve this as soon as possible. No data has been lost.

Posted Mar 15, 2017 - 11:05 CET

Monitoring

We've been noticing DB connection issues for a couple of hours. The problem is resolved now and we're monitoring the platform.

Posted Mar 15, 2017 - 07:02 CET