Photo by Daniel Watson

Automation tools make our lives easier.  They provide templates and provisioning instructions to build and maintain systems with ease and can save hours and days worth of repetitive work.  This is a story of how I fucked up and automation backfired on me.

Friday night was much like any other for me.  Drinking a couple of whiskeys while I steadily smash away at my keyboard to make improvements and discoveries in my homelab.  As I was poking around, I notice that it's been a few weeks since I last updated Ubuntu across the fleet of 8 servers that make up my Docker Swarm cluster.  Here's a basic diagram to help visualize the cluster.

Because this takes a while to do by hand, which is how I've been handling system upgrades so far, I decide to whip up a quick Ansible playbook to handle it for me.  It's a simple thing that looks like this.


- hosts: all
  become: true

    - name: Pass options to dpkg on run
        upgrade: dist
        autoremove: yes
        update_cache: yes
        cache_valid_time: 3600

    - name: Reboot the server and wait for it to come back up.

First, I ensure I can connect with one of my Odroid HC-1s with the ping module.

ansible venus -m ping                                                                                    
venus | SUCCESS => {
    "changed": false,
    "ping": "pong"

Alright, let's test this thing out and see what happens!

ansible-playbook -l venus playbook.yml                                                                   

PLAY [all] ********************************************************************************************************************

TASK [Gathering Facts] ********************************************************************************************************
ok: [venus]

TASK [Pass options to dpkg on run] ********************************************************************************************
ok: [venus]

TASK [Reboot the server and wait for it to come back up.] *********************************************************************
changed: [venus]

PLAY RECAP ********************************************************************************************************************
venus                      : ok=3    changed=1    unreachable=0    failed=0

Excellent! Looks like it worked just as it was intended.  The reboot module is pretty fantastic.  With a verified test, I decide it's ready to run this against 7 out of the 8 machines at once. I'm running the command from 1 of the Renegades, which is omitted from the run list with lab:!rocks. Some would consider this to be the fuck up and in a production environment, they would be right.  Since this is a lab environment where I purposefully like to break things in new ways, I weigh my odds and go forth with the upgrade.  I have all my important docker containers mounted to a Gluster volume distributed across 3 renegade machines, so HA will save my ass if anything goes wrong, right?  So, here it goes!

ansible-playbook -l 'lab:!rocks' playbook.yml

And I wait for it to do it's thing...

I see it chugging along for a good 10 minutes without exit, so I start to get concerned.  Some pings and checks later, I see that the Plex/NFS/(So many other things) server is back up and running fine, the Elasticsearch master server is powered back up and working.  Verify that the 3 Odroids are rebooted and functioning as Elasticsearch data nodes again. A lot of the docker services that were running came back up without issue.

However, nothing that was previously running on the 3 Renegade boards was coming back up.  This included Traefik, Prometheus, Grafana, Kibana, this ghost blog, a MySQL container used as the back-end for this blog,  another site I've been learning Phaser3 game development at, and the monitoring services that run as global (Node exporter and Cadvisor).  The really shitty part of these 2 out of 3 boards being down, is that they are the only 3 managers in the swarm cluster and now I only have 1 manager out of the 3 alive which is not enough for a functioning quorum. Plus, they are the 3 boards that provide the Gluster volume and without 2 out of the 3 it's not functioning until I get at least 1 more of them working again.  The reason none of the Docker services I expected to be running were up was because of this failed Gluster volume.

First things first, I promote two of the docker workers to managers, but in order to do so, I had to reinitialize the swarm by forcing a new cluster from the one functioning manager.

docker swarm init --force-new-cluster	
docker node promote worker_1
docker node promote worker_2

Great!  Now I have a functioning manager quorum again.  On to recovering the Renegades. I start to backup the only still alive Renegade board (rocks) to my NFS server.  I have Jenkins installed directly to the board, so if I lose it, I have to start over from scratch with provisioning most everything in my lab. I copy /var/lib/jenkins /etc and tar them up safely on the storage server.  Everything else I'm confident I can restore with Ansible playbooks after a fresh install.  I start to rsync -av /bricks/brick1 to my storage server, which is the entire contents of the Gluster partition from rocks and will provide me with a way of refreshing the Gluster volume if the disks don't mount for some reason after a reinstall of the OS.  About 3 minutes into this rsync and my ssh connection to the board locks up... OH CRAP!

All three of the Renegades are now dead and not responding after multiple reboots.  I'm too lazy to undo the case they're mounted in to hook them up via HDMI to a monitor to see what is going on and decide it's just a fast to do a clean install on the boards because I can easily have them back up and running in no time.  I assume the eMMC module used as the data drive will mount to the newly installed OS just fine and shouldn't be an issue.  So, I do exactly that.  I flash all three sd-cards with the latest version of Armbian Ubuntu 18.04, bootstrap them, and bring them back into the Swarm cluster without failure.

Because Jenkins was running standalone on rocks, I decided this time around I would not install it to a single board and will instead run it as a Service in Docker Swarm.  That way it will come back up, if anything like this happens again.  That was pretty easy.  I create a docker-compose.yml file to define what I want.  I tell it to open up port 8080 so I can reach it from my laptop and I also tell it how to connect to Traefik.  I mount /var/lib/jenkins so I can use my backup as the connected volume and it should come right back up (It didn't and I had to start from scratch anyways, but was worth a shot...).  I place a constraint on the service to have it only run on shred for now.


version: '3.5'


    image: jahrik/jenkins:x86_64
      - 8080:8080
      - 50000:50000
      - traefik
      - /nfs_share/jenkins:/var/jenkins_home
        traefik.enable: 1
        traefik.port: 8080
        traefik.backend: jenkins traefik
          - node.hostname == shred

    name: traefik
    driver: overlay
    external: true

I also need Ansible installed on this image, so I start with the jenkins/jenkins:2.164 image in a Dockerfile and do just that.


FROM jenkins/jenkins:2.164

USER root

RUN apt-get update \
  && apt-get -y install software-properties-common
RUN echo "deb trusty main" | tee -a /etc/apt/sources.list
RUN apt-key adv --keyserver --recv-keys 93C4A3FD7BB9C367
RUN apt -y update
RUN apt -y install ansible

USER jenkins

Finally, I put together a quick Makefile to drive it.

IMAGE = "jahrik/jenkins"
TAG := $(shell uname -m)

all: build

	@docker build -t $(IMAGE):$(TAG) -f Dockerfile_$(TAG) .

	@docker push $(IMAGE):$(TAG)

	@docker stack deploy -c docker-compose.yml $(STACK)

.PHONY: all build push deploy

Building and deploying my new Jenkins service.

make build
make push
make deploy

Next step is to recover the Gluster volume and remount all the service containers that use it.  This is were the real Fuck Up happened! I handle the Gluster setup with an Ansible playbook designed to take a group of 3 or more servers, provision them with the right partitions and directories, install the software needed, edit /etc/fstab and mount the volume to all the clients.  After this script is run, it's safe to run it again because it will see these services exist and in Ansible's omnipotent fashion of handling things will not overwrite them.  I have run this playbook countless times across this cluster to ensure that mounts come back up on the clients and felt good about it.  What I forgot to take into account after re-installing the OS on the three Renegade boards, was that I had not remounted the eMMC modules manually and Ansible saw them as brand new drives that needed partitioned and formatted as ext4 and that's exactly what happened when I ran this playbook...

#     - name: "Create an ext4 filesystem"
#       filesystem:
#         dev: "{{ item }}"
#         fstype: ext4
#         # force: yes
#       with_items:
#         - "{{ part }}"
#       tags:
#         - setup

Since then, I have commented out the offending tasks, and intend to keep it separate from the rest of my Gluster playbook, lesson learned.

So, what did I just lose and now how do I recover?  I just lost Prometheus data, which I don't really care that much about, it will regenerate.  I just lost all my Grafana data including user info and dashboards.  Not a huge deal because I was using mostly pre-configured stuff, but still quite a few hours of tweaking.  Same goes for Kibana and the countless hours I spent creating graphs.  I just lost my MySQL volume which was mostly lab databases for testing and learning, but the weekend prior, I had just migrated this ghost blog to use MySQL as the back-end and that's one of the things that hurt the most.  I had done so and worked on updating it for a good couple days afterwards, updating pictures and posts along the way.  I hadn't bothered to backup or export any of that work since then and all those changes were lost.  Traefik's SSL certs were lost, but Let'sEncrypt made it easy to renew those.

Luckily, I've been making it a habit, to solidify most of my configuration setups with source control and it was as simple as rebuilding Jenkins jobs to redeploy these services to swarm again and get them back online.  The main lesson I'm taking away from this experience, is that I can't trust an HA setup the same way I can trust a backup.  I realize that I need to start making regular backups of this stuff or this will happen again.  So that's just what I did.  I'm starting with some basic bash scripts to run as a cron job on my NFS box, to backup my Gluster volume nightly from /mnt/g1

#!/usr/bin/env bash
# backup /mnt/g1 to NFS share
FILE="g1-$(date +%Y-%m-%d).tar.gz"

# Ensure directories exist
mkdir -p $DEST

# Record start time by epoch second
START=$(date '+%s')

if ! rsync -av --delete $SRC $SYNC; then
  STATUS="rsync failed"
elif ! tar -czf $FILE $SYNC; then
  STATUS="tar failed"
elif ! mv $FILE $DEST ; then
  STATUS="mv failed"
  STATUS="success: size=$(stat -c%s $DEST$FILE) duration=$((`date '+%s'` - $START))"

# Log to system log; handle this using syslog(8)
logger -t backup "$STATUS"
echo "$STATUS"

I am also working on an offsite backup solution and considering Backblaze, but haven't quite gotten it working from the CLI yet.

In conclusion, make regular backups and don't trust an HA setup like you would a backup or you'll spend your weekend working and not playing like you should be.