Outage
Incident Report for Blitline
Postmortem

On the morning of 9/12/18 we experienced an outage between 9:09 and 10:00 am PST.

Around 9:14 we noticed that we were getting outage notifications from our monitoring services and began diagnosis and mitigations.
It appeared at the time that our webservices (both EU and US) we not responding to requests. Our automated restarts we not restarting
the services and other support services (queues and some datastores) were not responding as well. Manual attempts at restarting services and machines
all failed or hung. We were able to isolate the issue as being related to our logging platform and were able to restart that service and bring everything
back online at or around 10:00am.

The underlying cause was our dependency on Docker containers for deployments. We lean heavily on container technology, and in doing so had
overlooked an underlying failure point for Docker containers, which is their built in logging support. As a somewhat obscure feature(?) of docker,
if the logging to a remote system loses connection, Docker hangs (without outputting an error). Attempts to restart the container will hang (without any error message output either to
syslog or elsewhere). Even our attempts to stop containers failed due to this same reason. So as an important community 'heads up', it's probably best to
not use Docker's built in logging, OR make sure that your logging platform will NEVER become unavailable or you will lose the ability to run a Docker ecosystem.

Mitigation:

  1. We have moved away from using Docker's built in logger and moved towards a separate service which sends our logs to our platform, thus bypassing the 'block' on
    log unavailability built into Docker.
  2. Added a 'log' ping functionality that will notify us, ahead of time, when our logging platform appears to be unavailable or not responding quickly to network calls.

Apologies:

To those who lost data and were unable to process data on Blitline's service, we always take outages very seriously and work to not only resolve them quickly but
to put in place mitigations to prevent further similar issues and processes to identify problems even quicker and resolve them.

Posted Sep 19, 2018 - 17:13 UTC

Resolved
Systems have stabilized.
Posted Sep 12, 2018 - 20:41 UTC
Monitoring
We have mitigated the issue and are investigating underlying causes and mitigations
Posted Sep 12, 2018 - 17:30 UTC
Investigating
We experienced downtime between around 9:09 and 10:00am this morning. We have been able to restore service and are investigating the cause.
Posted Sep 12, 2018 - 17:06 UTC
This incident affected: API, Blitline Website, and Backend Cloud Servers.