Earlier this month my home assistant installation broke. During an auto-update something went wrong and the system was left in a state where I could not bring it back. I am using Home Assistant Supervised on a Debian VM, and there are some warnings about this kind of setup.

Well, after some attempts to bring the service back to life I decided to take this as an opportunity to learn a little bit of some automation tools. On the next session I’ll describe which tools and why, and then the steps I used to provision my instance of Home Assistant.

I’ll show some versions of the scripts I used, if you are only interested on the final scripts, scroll to the end ;)

The right tool for the right job

At first I intended on using Terraform to create the VMs and install Home Assistant, but after reading a little about this tool it seemed like that was not a good idea. The installation part was not connecting very well with what I just read. That is where Ansible comes to play.

Both tools are used for automation, and although there are some comparisons between them and even ways to do the same thing on both (I found tutorials on creating a K8s cluster using both), for my use case I decided to create the VM using Terraform (setting VM specs, attached network interfaces, etc) and setup Home Assistant using Ansible (install dependencies, the software itself and restore my backup).

Provisioning the VMs with Terraform

There is no official libvirt Provider for Terraform at the time of this writing, but there is a community one! Using this provider I was able to give my first steps. With very little effort I was able to spin a Debian VM. Following this example, I got this tf file:

terraform {
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
    }
  }
}

provider "libvirt" {
  uri = "qemu+ssh://[email protected]/system"
}

resource "libvirt_volume" "ubuntu-qcow2" {
  name   = "homeassistant-qcow2"
  source = "http://cloud.debian.org/images/cloud/bullseye/latest/debian-11-genericcloud-amd64.qcow2"
  format = "qcow2"
}

data "template_file" "user_data" {
  template = file("${path.module}/cloud_init.cfg")
}

data "template_file" "network_config" {
  template = file("${path.module}/network_config.cfg")
}

resource "libvirt_cloudinit_disk" "commoninit" {
  name           = "commoninit.iso"
  user_data      = data.template_file.user_data.rendered
  network_config = data.template_file.network_config.rendered
}

resource "libvirt_domain" "homeassistant" {
  name      = "test-ha"
  vcpu      = 1
  memory    = "1024"

  cloudinit = libvirt_cloudinit_disk.commoninit.id

  network_interface {
    network_name = "default"
  }

  console {
    type        = "pty"
    target_port = "0"
    target_type = "serial"
  }

  console {
    type        = "pty"
    target_type = "virtio"
    target_port = "1"
  }

  disk {
    volume_id = libvirt_volume.ubuntu-qcow2.id
  }
}

I’ll cover more about cloud_init.cfg and network_config.cfg we just saw above in a while. To transform the text file above into a running VM there were some problems.

First, I was using Debian 10 to run terraform, and it did not had a symlink to mkisofs, which is necessary to generate the cloud init ISO. There is a discussion on the libvirt provider github about this issue, but it was also marked as a bug on Debian, and all I needed to do was creating a symlink from xorriso to mkisofs. From what I tested on debian 11 this was fixed already, I do not know if there is a fix on the way for 10.

Once the VM was successfully created, the problem was booting it up. For some reason apparmor was blocking the hypervisor from using an image from a pool, but not when a file was directly referenced. The dirty and quick fix is described here, and is basically turning apparmor off. I’m curious if there is some configuration to fix this, but I could not find any in the short time I searched, and although I was running libvirt 5.0.0 on the first time I tested, on libvirt 7.0.0 the issue is still there. If instead of a pool we referenced a file, the issue would not appear.

Just a bit more of detail, from the logs we could see:

Sep 12 19:03:51 homelab kernel: audit: type=1400 audit(1631484231.392:101): apparmor="DENIED" operation="open" profile="libvirt-a2a6e813-e14e-4d5c-9981-47adc7b5dca6" name="/home/carlos/volumes/homeassistant-qcow2" pid=8178 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=64055 ouid=64055
Sep 12 19:03:51 homelab kernel: audit: type=1400 audit(1631484231.392:102): apparmor="DENIED" operation="open" profile="libvirt-a2a6e813-e14e-4d5c-9981-47adc7b5dca6" name="/home/carlos/volumes/homeassistant-qcow2" pid=8178 comm="qemu-system-x86" requested_mask="wr" denied_mask="wr" fsuid=64055 ouid=64055
Sep 12 19:03:51 homelab kernel: audit: type=1400 audit(1631484231.392:103): apparmor="DENIED" operation="open" profile="libvirt-a2a6e813-e14e-4d5c-9981-47adc7b5dca6" name="/home/carlos/volumes/homeassistant-qcow2" pid=8178 comm="qemu-system-x86" requested_mask="r" denied_mask="r" fsuid=64055 ouid=64055

Setting up the basics of the VM

The script above used defined resource "libvirt_volume" "ubuntu-qcow2" directly from the base image, and the created VM has de exact same disk size of the iso. I tried to change the size property from the volume, but that rendered an error when applying the tf script.

Reading a little bit more the docs I found this:

If size is specified to be bigger than base_volume_id or base_volume_name size, you can use cloudinit if your OS supports it, with libvirt_cloudinit_disk and the growpart module to resize the partition.

If that information, I isolated the base image on another terraform file, defining a base image:

terraform {
  required_providers {
    libvirt = {
      source = "dmacvicar/libvirt"
    }
  }
}

provider "libvirt" {
  uri = "qemu+ssh://[email protected]/system"
}

resource "libvirt_pool" "base_images" {
  name = "base_images"
  type = "dir"
  path = "/home/carlos/base"
}

resource "libvirt_volume" "debian-cloud" {
  name   = "debian-cloud-qcow2"
  pool   = libvirt_pool.base_images.name
  source = "http://cloud.debian.org/images/cloud/bullseye/latest/debian-11-genericcloud-amd64.qcow2"
  format = "qcow2"
}

resource "libvirt_volume" "ubuntu-cloud" {
  name   = "ubuntu-cloud-qcow2"
  pool   = libvirt_pool.base_images.name
  source = "https://cloud-images.ubuntu.com/focal/current/focal-server-cloudimg-amd64-disk-kvm.img"
  format = "qcow2"
}

And then moved to using base_volume_name I defined a new volume with the base image, and now my homeassistant image would use that as a base:

resource "libvirt_pool" "home_assistant" {
  name = "home_assistant"
  type = "dir"
  path = "/home/carlos/volumes"
}

resource "libvirt_volume" "home_assistant-qcow2" {
  name   = "homeassistant-qcow2"
  pool   = libvirt_pool.home_assistant.name
  base_volume_name = "debian-cloud-qcow2"
  base_volume_pool = "base_images"
  size = 21474836480
  format = "qcow2"
}

According to Growpart docs:

Growpart is enabled by default on the root partition.

After removing all the resources created before and running terraform apply again, I could login into the newly created VM and see that indeed the disk adapted without any further configuration!

$ df
Filesystem     1K-blocks   Used Available Use% Mounted on
udev              234220      0    234220   0% /dev
tmpfs              48700    444     48256   1% /run
/dev/vda1       20480580 646100  18964844   4% /
tmpfs             243492      0    243492   0% /dev/shm
tmpfs               5120      0      5120   0% /run/lock
/dev/vda15        126678   6016    120662   5% /boot/efi
tmpfs              48696      0     48696   0% /run/user/1000

Now, I only need to be able to access this VM from my whole LAN and also to be able to SSH into it. Until now I was letting the hypervisor assign an IP, which was only accessible from within the Host, and I manually setup a password inside my cloud_init.cfg using the chpasswd module.

For the SSH part I will modify our template a little so we can read the id_rsa.pub from my machine inside the cloud_init.cfg. The documentation around those tools are really helping. Searching for terraform template leads me to this page, and with this I can also replace the template_file resources, following the recomendation.

The changes to the terraform file looked like:

I’ll talk about the network_config in a while!

variable "ipv4_address" {
  type        = string
  default     = "192.168.1.116"
  description = "IPV4 Address to be assigned to the new VM"
}

variable "public_key_path" {
  type        = string
  default     = "/home/carlos/.ssh/id_rsa.pub"
  description = "Public key to be installed to guests"
}

resource "libvirt_cloudinit_disk" "commoninit" {
  name           = "commoninit.iso"
  pool           = libvirt_pool.home_assistant.name
  user_data      = templatefile("${path.module}/cloud_init.cfg", { 
    SSH_PUB_KEY: file(var.public_key_path)
  })
  network_config = templatefile("${path.module}/network_config.cfg", {
    IPV4_ADDR: var.ipv4_address
  })
}

And cloud init:

hostname: homeassistant

users:
  - name: carlos
    sudo: ALL=(ALL) NOPASSWD:ALL
    ssh-authorized-keys:
      - ${SSH_PUB_KEY}

Well, this is one point where the documentation tricked me. Searching the examples I could find several setups with a ssh_authorized_keys parameter, but it simply did not work. Later on I discovered that ssh-authorized-keys (note the underscore vs hyphen) was the one I was looking for. While debugging this, virsh console was my friend since I was having trouble with ssh.

I was seeing something like this:

2021-10-11 18:35:47,565 - util.py[WARNING]: Failed to run command to import carlos SSH ids
2021-10-11 18:35:47,565 - util.py[DEBUG]: Failed to run command to import carlos SSH ids

Fortunately, I found the hyphen difference (by accident) before going through the write_files route, simply copying the files manually.

Other configuration that was tackled with cloud_init was setting the default IP. Currently I bridge the host network to the guests, and I wanted to have a predefined IP configured. This required a change to a part of cloud_init’s configuration and libvirt_domain’s network_interface.

Thankfully, on terraform script I just had to change the network_interface to:

network_interface {
  bridge = "kvm_br0"
}

where “kvm_br0” is the name of the bridge on the host, and on cloud init, I configured the network_config:

version: 2
ethernets:
  ens3:
    renderer: NetworkManager
    addresses: [${IPV4_ADDR}]
    gateway4: 192.168.1.1
    nameservers: 
      search: [lab, home]
      addresses: [192.168.1.242, 192.168.1.1]

The IP (${IPV4_ADDR}) was templated, and I just configured the DNS servers. One important part was setting renderer: NetworkManager, since HomeAssistant’s Supervisor manages NetworkManager and my cloud image came without this service. This also required an additional part on cloud_init.cfg:

packages:
  - network-manager

With the VM up and running with the desired configurations, let’s install Home Assistant with Ansible.

Bootstrapping Home Assistant with Ansible

First of all, I installed ansible using pip (with apt on my debian 10 I was getting a very old version, and had a problem, so I opted for pip), then I configured my host on /etc/ansible/hosts, adding:

192.168.1.116 ansible_python_interpreter=/usr/bin/python3

After that, with my VM on, I could try it:

$ ansible all -m ping
192.168.1.116 | SUCCESS => {
    "changed": false,
    "ping": "pong"
}

The ansible_python_interpreter parameter was needed on /etc/ansible/hosts since ansible was trying to use python, and only python3 was defined (ok, later I discovered that this is not an issue on newer ansible versions). Following Home Assistant Supervised installation instructions I started defining the playbook.

During the process, I had to programmatically retrieve the url of the last version of a github package, this blog post helped me achieve an one-liner to download:

curl -s https://api.github.com/repos/home-assistant/os-agent/releases/latest | jq -r ".assets[] | select(.name | contains(\"x86_64\")) | .browser_download_url"

And this other post helped me find better ways to download and install the deb files used during the process.

Installing all needed packages was as simples as

---
- name: Test
  become: true
  hosts: all
  tasks:
    - name: Install Docker Dependencies
      apt: 
        name: "{{ item }}"
        state: latest
        update_cache: yes
      loop: ['apt-transport-https', 'ca-certificates', 'curl', 'gnupg', 'lsb-release']
    - name: Add Docker GPG Key
      apt_key: 
        url: https://download.docker.com/linux/debian/gpg
        state: present
    - name: Add Docker Repository
      apt_repository:
        repo: deb https://download.docker.com/linux/debian bullseye stable
        state: present
    - name: Install Docker
      apt: 
        name: "{{ item }}"
        state: latest
        update_cache: yes
      loop: ['docker-ce', 'docker-ce-cli', 'containerd.io']
    - name: Install Additional Dependencies
      apt: 
        name: "{{ item }}"
        state: latest
        update_cache: yes
      loop: ['jq','wget','curl','udisks2','libglib2.0-bin','network-manager','dbus', 'rsync']
    # Prepares folder for downloads
    - name: Create a directory if it does not exist
      file:
        path: /opt/haosagent
        state: directory
        mode: '0755'
      register: folder
    # Installs OS_Agent
    - name: Get os_agent URL
      shell: 
        cmd: curl -s https://api.github.com/repos/home-assistant/os-agent/releases/latest | jq -r ".assets[] | select(.name | contains(\"x86_64\")) | .browser_download_url"
        warn: False # Only want output name, we are not downloading anything
      register: os_agent_url
    - name: Download os_agent
      get_url: 
        url="{{ os_agent_url.stdout }}"
        dest="{{ folder.path }}" # Download to folder so we can use "Changed" status
      register: os_agent_path
    - name: Install os_agent
      apt: deb="{{os_agent_path.dest}}"
      when: os_agent_path.changed
    # Installs Supervisor
    - name: Download supervisor
      get_url: 
        url="https://github.com/home-assistant/supervised-installer/releases/latest/download/homeassistant-supervised.deb"
        dest="{{ folder.path }}" # Download to folder so we can use "Changed" status
      register: supervisor_path
    - name: Install supervisor
      apt: deb="{{supervisor_path.dest}}"
      when: supervisor_path.changed

Some important notes:

HA Supervised has a list of requirements, but all those configurations were defaults at the time of writing this, so I did not bother to make sure (if this was a critical production setup, that would be a good idea)
get_url when a folder is specified it will always download and check the hash to determine if the file was modified. Adding a register to this task will save the file name and a changed variable, so I can call the installation task conditionally, so running the playbook again will theoretically (I still need to test this with a future release) update the package.
before any download I call the file task to make sure there is a folder where I can save the packages

All other tasks are simply installing dependencies.

Restoring the backup

I do not have any backup system (bad, yeah :/) but I was able to retrieve some basic configurations files from the broken server to save me from the trouble. I ran the ansible defined above, copied the configuration files into the new server and manually finished setup, with that I had a working home assistant again.

There is a cli tool that handles backups on Home Assistant, so I went and run ha backups new to generate a backup from my just created and restored server. With that, I could use ha backups restore to restore to this backup, and I added this to my script to copy the backup and restore the server (for now, I had to manually place the backup tar file on the machine where ansible is running):

- name: Synchronization of src on the control machine to dest on the remote hosts
  synchronize:
    src: /home/carlos/backups/111cfc64.tar
    dest: /usr/share/hassio/backup
- name: Restore backup
  ansible.builtin.shell:
    cmd: ha backups reload && ha backups restore 111cfc64

Bringing Terraform and Ansible together

At the end of terraforming I want to be able to run my playbook on the target host, this can be done with a provisioner "local-exec" inside my resource "libvirt_domain". But hey:

fatal: [192.168.1.116]: UNREACHABLE! => {"changed": false, "msg": "Failed to connect to the host via ssh: ssh: connect to host 192.168.1.116 port 22: No
│ route to host", "unreachable": true}

I need to wait until the VM is up and running before executing ansible. The kvm provider I am using does provide a parameter wait_for_lease to wait for a DHCP lease, but since I am hard coding the networking configurations (eventually I want to setup a better network at home, but for now…), I had to improvise with ansible. Setting gather_facts: no and adding a wait_for_connection task did the trick, and I just moved gather_facts to just after the connection has been established:

- name: Test
  become: true
  hosts: all
  gather_facts: no
  tasks:
    - name: Wait 600 seconds for target connection to become reachable/usable
      wait_for_connection:
    - name: Gathering facts
      setup:
...

And added to terraform:

provisioner "local-exec" {
  command = "ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -i '${var.ipv4_address},' --private-key ${var.private_key_path} -e 'pub_key=${var.public_key_path}' test.yml"
}

When trying to install apt packages right after booting for the first time I was stumbling upon a locked apt. There is a lock_timeout parameter created for this, but it seems to miss /var/lib/dpkg/lock-frontend, which is locked when it runs. There are workarounds that suggest looking for cloud-init messages at the journal (which is too specific), and it seems like a bug on lock_timeout. Looking at the code, it checks for a LockFailedException on python-apt, but for some reason, it does not appear to be working. I looked around a little bit around the package’s code and debian’s open bugs, but no relevant information was there, so for now I’ll be going around this. Maybe my next blog post will be a little more about apt’s lock mechanism :)

So this other task was added:

- name: Wait for dpkg locks
  shell: while fuser /var/lib/dpkg/ >/dev/null 2>&1; do sleep 1; done;
  with_items:
    - lock
    - lock-frontend

Did not work. New attempt, following this example

- name: Wait for cloud init to finish
  community.general.cloud_init_data_facts:
    filter: status
  register: res
  until: "res.cloud_init_data_facts.status.v1.stage is defined and not res.cloud_init_data_facts.status.v1.stage"
  retries: 50
  delay: 5

Also not reliable - worked once, failed another time. I could not find the documentation to the output of cloud_init_data_facts, so I tried yet another route: cloud init has a cli with a status command that has a wait parameter that, according to docs, will block until completion. That allied with a hardcoded check on the output (not ideal, but as a first version of this script will be enough), allowed a more reliable wait (as far as I could tell):

- name: Wait cloud init success
  ansible.builtin.shell:
    cmd: cloud-init status --wait
  register: cinitres
  failed_when: '"status: done" not in cinitres.stdout'

Well, packages are installing!

Finally, restoring

Well, from creating the VM to a running service I was able to automate using Terraform and Ansible, the last step to provide the whole deal was restoring the backup. Running the ha backups scripts I defined above immediatelly gave me an error: the services for home assistant were not up yet. My simple solution was to run the same script until I received a status code 0, and only then restoring. The following was added:

Important, I needed to wait before copying the backup too, or else supervisor would detect the backup folder and fail startup

- name: Wait Hassio Setup
  ansible.builtin.shell:
    cmd: ha backups
  register: ha_cli_status
  until: ha_cli_status is success
  delay: 10
  retries: 300

And done! Whole thing took 7 minutes to run, and now I can wipe the VM clean and start again anytime, but more important, this is the base to expand IaC to other services on my homelab.

Conclusion

20+ hours to save 30 minutes, yay!

All scripts are available here. Mind that this link points to a fixed commit so that it makes sense with this text, but there may be newer things on main as I evolve my automations.

But built up knowledge to automate setup of other tasks I intend to.

Summing Up

With Terraform I could setup my infrastructure using qemu+kvm, and with the help of cloud-init configure SSH and networking, giving a working VM. Ansible is called at the end o the tf script, and with proper waits in place it installs and configures everything, restoring the backup at the end.

The biggest lesson for me during this experiment was starting to think about the waits needed: cloud-init and home assistant took some time running in the background, and checking if they finished running was different than what I was used too. There were no mutexes or semaphores, or even higher level promisses or asyncs: some other process started a service that was triggering a series of changes, and each one had its way to determine when it was ready. I deemed it more challenging than simply handling multiple threads on the same program, and it was a fun new way to think :)