On the morning of 9/12/18 we experienced an outage between 9:09 and 10:00 am PST.
Around 9:14 we noticed that we were getting outage notifications from our monitoring services and began diagnosis and mitigations.
It appeared at the time that our webservices (both EU and US) we not responding to requests. Our automated restarts we not restarting
the services and other support services (queues and some datastores) were not responding as well. Manual attempts at restarting services and machines
all failed or hung. We were able to isolate the issue as being related to our logging platform and were able to restart that service and bring everything
back online at or around 10:00am.
The underlying cause was our dependency on Docker containers for deployments. We lean heavily on container technology, and in doing so had
overlooked an underlying failure point for Docker containers, which is their built in logging support. As a somewhat obscure feature(?) of docker,
if the logging to a remote system loses connection, Docker hangs (without outputting an error). Attempts to restart the container will hang (without any error message output either to
syslog or elsewhere). Even our attempts to stop containers failed due to this same reason. So as an important community 'heads up', it's probably best to
not use Docker's built in logging, OR make sure that your logging platform will NEVER become unavailable or you will lose the ability to run a Docker ecosystem.
Mitigation:
Apologies:
To those who lost data and were unable to process data on Blitline's service, we always take outages very seriously and work to not only resolve them quickly but
to put in place mitigations to prevent further similar issues and processes to identify problems even quicker and resolve them.