US Platform Outage

Incident Report for Blitline

Postmortem

Portmortem (Severely degraded service -1/25/2020 5:45am ~ 11:45am)

Background:

We run a federated system of queues from Docker images on Ubuntu machines.

These queues are used by the front end (web) and backend (workers) to handle our asynchronous workflow.

Problem:

Although it is not entirely clear the true underlying culprit, there is a series of events and reactions we can see to isolate the problem.

It appears there was a recent Ubuntu security update (which we automatically apply) to our machines that hold our queues. This update appeared to cause our Docker containers to start slowly using more and more IO bandwidth. While our queues are never under even medium load, this every-increasing IO started to slow down our queue boxes. What this lead to was a slower and slower network response from our queues. Unfortunately, they never got to "unreachable" or "down" status, they were transferring data at about 1-8k second. This is a pretty poisonous situation, particularly when all the queues are behaving similarly. The connections wouldn't 'fail to connect', they would just take a long time.

Our code allows for individual connections to a queue to timeout after a reasonable amount of time (10 seconds or so), and try a different queue. Unfortunately this failover simply had the same outcome. As such, every job coming in was taking up single connection for at least 30 seconds, and thus at high load, exhausting all our connections available for processing API events (and web site connections as well).

Our assumption that was at fault was that of multiple queue machines available and ready for failover, one would work as expected. The slow IO creep slowed all queue machines to a near standstill. Even when we apply queue code updates we stagger them to make sure it does cause a universal problem. We assumed the ubuntu updates were safe to install universally (whenever). This appears to not be the case.

Solution:

The immediate solution (which ultimately brought everything online) was a simple reboot of the queue boxes (which in hindsight probably should have been one of our immediate responses, but we were busy looking for underlying causes, and it took a while to realize the IO problems, as we were initially chasing DNS or network problems as the culprit, and which ultimately were nott).

Prevention:

Currently, our best course of action forward to prevent similar problems in the future is to schedule ubuntu updates at different time periods during the day(or days), allowing for 'problems' to manifest on one machine vs. all machines. A single failing queue will be taken out of the rotation and can be dealt with individually without any failures.

Unknowns:

We still have no idea why the ubuntu update caused the IO problems in the first place, and to that extent why the reboot seemed to fix them. We will continue to watch this, but it's unclear spending too much effort trying to identify the underlying cause might be better spent mitigating that issue instead.

Apologies:

We offer this postmortem as both an explanation and promise to always try to do better and improve the stability and performance of Blitline.com. This is not an excuse, we are sorry this happened, and we were all-hands to get Blitline back online and working as expected, but unfortunately it's not always possible to maintain 100% uptime (although we try really hard to be a bunch of decimal places near that).

Blitline Team

Posted Jan 30, 2020 - 22:35 UTC

Resolved

US Platform outage(Degraded performance)

Posted Jan 18, 2020 - 05:30 UTC