On Friday the 7th December, Moltin experienced an incident that resulted in around 2.5 hours of outage and an additional 2 hours of degraded service.
One of our 3rd party database providers suffered an outage, and several small but critical parts of our API were tied to that provider. Because the incident affected some key parts of our platform, that meant every request to and from the API was affected.
We ultimately lost no user data, and when requests did it make it through our system, our internal integrity meant there was no discrepancies in things like order and transactional data.
We take performance and uptime of our APIs very seriously, and this incident is not reflective of the standard we strive to hit everyday.
Thank you again for your patience during this incident. Moltin greatly values our customers and we appreciate your business. Please don't hesitate to reach out with any questions.
Witness (Our uptime monitoring service) started alerting that healthcheck requests to multiple services in the API were failing. These alerts were forwarded into the monitoring channel in Slack and the on call engineer (Alex) via PagerDuty. An additional monitor was also triggered because the API error rate increased sharply.
The monitors were continuously triggered and then resolved, this was caused by the healthcheck failing and then passing, demonstrating that the failures were intermittent.
At 7.48 Alex noted that currencies & settings were marked as unhealthy. This was caused by the service specific healthchecks failing and causing the apps to restart. At least some of the failed responses at this point would have been caused by the applications not being available due to restarts rather than the DB requests failing.
James created a StatusPage entry:
We're investigating reports of increased error rates across the API.
Investigating - We're investigating reports of increased error rates across the API. - Dec 7, 19:49 UTC
Alex noted that he'd tried to restart settings in order to get it back into the healthy state but this hadn't worked.
James redeployed settings & currencies with the healthchecks disabled in an attempt to get the services to stay up.
James updated monitoring to say we had identified the underlying issue as the database nodes managed by our 3rd party provider and that we had opened a support ticket. At this point we believe one node was experiencing issues. Generally the API can tolerate this, but we were getting lots of timeouts from database queries.
The StatusPage incident was updated to Identified
with a severity of Major
.
James redeployed the settings & currencies applications with a consistency level setting of ONE
. This would allow the applications to work with only one working node. At this point, triggered alerts started to resolve automatically and we believed we were serving around 90% of requests successfully.
StatusPage incident updated:
We have mitigated the issue and most requests are being served successfully now. We will continue to work on resolving the root cause.
Multiple Witness & Insanitarium alerts started triggering again. We identified that a second node was experiencing issues. At this point we were still waiting for a response from our 3rd party provider for the original ticket (opened at 8pm).
StatusPage incident was upgraded to major:
This issue is still ongoing and is affecting multiple stores and endpoints. We are working to resolve as quickly as possible.
At this point we were essentially helpless until our 3rd party provider responded to our ticket.
The currencies service had stopped and could not restart as it was unable to connect to the database at all.
James responded to his original support ticket with our 3rd party provider.
James sent a support email direct to our 3rd party provider rather than using their ticketing system.
James noted in the engineering-war-room channel that the node had started showing signs of attempting to come back online.
Our 3rd party provider support had responded to the original ticket (created at 8pm)
Hi James - We're looking into it. Will update when we have more information.
James noted that judging by the memory usage on the nodes, they were being restarted.
Alex noted that he was getting successful responses from the orders service. James noted that catalogue was also working.
Our 3rd party provider responded with their explanation for the issue:
SSTable write was failing on the primary. We did a rolling restart of all nodes, which resolved the issue.
We will be working to implement cache changes to alleviate the problem if our 3rd party provider fails again
We will be working into Q1 2019 to remove our 3rd party provider from our stack
It took 20 mins to communicate that we knew what the issue was externally, which I know we knew faster than that. So there's definitely an item to improve on.
Redeploying the consistency level was a good quick thinking fix, we should get to that quicker in future.
Apart from that, there was a huge amount of frustration at the issue being outside of our control.