Background

When a Wireguard tunnel is enabled, the configuration system parses the configuration file and picks an IP address endpoint using DNS resolution for any hostnames it finds. This only happens once, when the tunnel is first enabled.

Let’s say we have the following Wireguard interface configuration file.

File: /etc/wireguard/wg0.conf

[Interface]
  PrivateKey = 6KlCNMKAkKcfT+iQxfl7ABTz4yiso5iwGjJgdGVu9VQ=
  Address = 172.25.1.2/24,fd08:5771:2371::2/64


[Peer]
  PublicKey = fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI=
  AllowedIPs = 172.25.1.1/24,fd08:5771:2371::1/64
  PersistentKeepalive = 20
  Endpoint = my-wireguard-peer.mydomain.tld:51820

When the interface is brought up, my-wireguard-peer.mydomain.tld will be resolved to whatever IP seems appropriate at that time, and this resolution will never be repeated.

If the peer endpoint hostname has both an A (IPv4) and an AAAA (IPv6) record, Wireguard will pick an IPv6 address for the peer if the local host has a global IPv6 address. If the local host does not have a global IPv6 address, it will pick an IPv4 address for the peer.

You’ll end up with a Wireguard configuration that looks something like this.

Run sudo wg show wg0:

interface: wg0
  public key: etGn5y6izVL8im6ZDEERzChKdfzMUiscao0QRTDmGXA=
  private key: (hidden)
  listening port: 33125

peer: fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI=
  endpoint: 2001:db8::1:51820
  allowed ips: 172.25.1.1/24,fd08:5771:2371::1/64
  latest handshake: 16 seconds ago
  transfer: 4.03 MiB received, 405.21 KiB sent
  persistent keepalive: every 20 seconds

So far so good, as long as 2001:db8::1:51820 is (and remains) a valid endpoint for this peer.

Problem

There are several situations where the correct / valid endpoint may change:

  1. The peer has disappeared, come back under a different IP address, and updated its DNS entry.
  2. There are multiple A or AAAA records for the endpoint hostname, and only some of them work.
  3. The IPv6 connectivity on the local host used to work but is now broken (i.e. still has a global address but can no long reach the peer endpoint).

In any of these cases, you’ll be stuck with a broken Wireguard setup, since Wireguard does not perform self-diagnostics or attempt to recover. This is a deliberate design decision; it is currently the responsibility of the user to perform endpoint selection (see here).

In my specific case, I use Wireguard on field-deployed hosts to tunnel out of a wide variety of networks, some with flaky or nonfunctional IPv6 implementations that I cannot control. It is important to me that these remote hosts have a robust mechanism to tunnel out, even under shifting network conditions, so they don’t become stranded.

I found that if a remote peer was able to tunnel out successfully via IPv6, but then IPv6 later became flaky or broken, the peer would be unreachable, since Wireguard would continue trying to use the IPv6 endpoint.

Wireguard does have a reresolve-dns.sh script available that can be called to re-resolve DNS (see here), but this script does not perform any reachability checks. If you run it periodically using cron, you’ll solve 1. from above (changing DNS), but you won’t solve 2. or 3. from above (multiple addresses or broken protocol).

Other applications implement what is known as “happy eyeballs” (see here) to fall back to IPv4 if IPv6 fails. They typically will also try alternate IP addresses from a DNS entry if one fails. Because Wireguard does not do these things, it is our responsibility to do them ourselves.

Goals

For a given hostname (i.e. my-wireguard-peer.mydomain.tld), Wireguard should be configured with a “reachable” peer IP address at all times, even if that peer IP address changes due to changing network conditions or peer IP addresses.

  • If IPv6 breaks, Wireguard should continue to work by using an IPv4 address for the peer (i.e. “happy eyeballs”).
  • If a given IP address stops working, Wireguard should continue to work if any of the other IP addresses still work.

Solution Overview

The solution I use is a Python script, run every minute via cron, which attempts to determine “reachability” for every IP address that a given hostname resolves to. Once a suitable candidate is determined, it runs a wg set command to update the peer endpoint for the given peer.

“Reachability” in this case is via ICMP. This doesn’t actually indicate whether Wireguard works, since it’s a completely different protocol, but it can be a reasonable hint, if you have your Wireguard peer host configured to respond to ICMP.

Solution Details

I have a Python script configured to run every minute using cron on these remotely-deployed Wireguard peers. It requires icmplib. The script is invoked as follows:

./wireguard_endpoint_manager.py \
--interface wg0 \
--peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= \
--host my-wireguard-peer.mydomain.tld \
--port 51820

When fixing the specific case I outlined above, the script will produce output that looks something like this:

Candidate addresses: ['2001:db8::1', '203.0.113.1']
2001:db8::1 is not reachable
203.0.113.1 is reachable
All reachable addresses: ['203.0.113.1']
Chosen endpoint is 203.0.113.1:51820
Existing endpoint is [2001:db8::1]:51820
Executing wg set wg0 peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= endpoint 203.0.113.1:51820
Done executing wg set wg0 peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= endpoint 203.0.113.1:51820

This should allow the Wireguard tunnel to re-establish whenever network conditions change.

The script contents (wireguard_endpoint_manager.py) are as follows:

#!/usr/bin/env python3

import argparse
import os
import shlex
import socket
import subprocess
from contextlib import suppress

import icmplib


def main():
    parser = argparse.ArgumentParser(description="Manage Wireguard Endpoint")
    parser.add_argument("--dryrun", action='store_true')
    parser.add_argument("--interface", required=True)
    parser.add_argument("--peer", required=True)
    parser.add_argument("--host", required=True)
    parser.add_argument("--port", required=True)

    args = parser.parse_args()

    dryrun = args.dryrun
    interface = args.interface
    peer = args.peer
    host = args.host
    port = args.port

    chosen_endpoint = pick_endpoint(host, port)

    update_endpoint(dryrun, interface, peer, chosen_endpoint)


def update_endpoint(dryrun, interface, peer, chosen_endpoint):
    wg_test_command = "wg show {} endpoints".format(interface)
    ret = subprocess.check_output(shlex.split(wg_test_command), shell=False)
    existing_endpoint = os.fsdecode(ret).rstrip().split('\t')[1]
    print("Existing endpoint is {}".format(existing_endpoint))
    if existing_endpoint == chosen_endpoint:
        print("Nothing to do: endpoint is already {}".format(
            existing_endpoint))
        exit(0)

    wg_update_command = "wg set {} peer {} endpoint {}".format(
        interface, peer, chosen_endpoint)

    if dryrun:
        print("DRY RUN: Would have executed {}".format(wg_update_command))
    else:
        print("Executing {}".format(wg_update_command))
        ret = subprocess.call(shlex.split(wg_update_command), shell=False)
        if ret == 0:
            print("Done executing {}".format(wg_update_command))
        else:
            print("Error executing {}: {}".format(wg_update_command, ret))
            raise Exception("Error executing {}: {}".format(
                wg_update_command, ret))
    exit(0)


def pick_endpoint(host, port):
    addrs_v6 = []
    addrs_v4 = []

    with suppress(socket.gaierror):
        v6_results = socket.getaddrinfo(host, None, socket.AF_INET6)
        for v6_result in v6_results:
            addrs_v6.append(v6_result[4][0])
        # De-dupe
        addrs_v6 = list(dict.fromkeys(addrs_v6))
        # filter ipv6-mapped addresses
        addrs_v6 = [x for x in addrs_v6 if not x.startswith("::ffff:")]

    with suppress(socket.gaierror):
        v4_results = socket.getaddrinfo(host, None, socket.AF_INET)
        for v4_result in v4_results:
            addrs_v4.append(v4_result[4][0])
        # De-dupe
        addrs_v4 = list(dict.fromkeys(addrs_v4))

    if len(addrs_v6) == 0 and len(addrs_v4) == 0:
        print("No addresses resolved for {}.".format(host))
        exit(-1)

    print("Candidate addresses: {}".format(addrs_v6 + addrs_v4))

    healthy_addrs_v6 = []
    for addr_v6 in addrs_v6:
        result = icmplib.ping(
            addr_v6, count=3, interval=0.5, timeout=2)
        if result.is_alive:
            print("{} is reachable".format(addr_v6))
            healthy_addrs_v6.append(addr_v6)
        else:
            print("{} is not reachable".format(addr_v6))

    healthy_addrs_v4 = []
    for addr_v4 in addrs_v4:
        result = icmplib.ping(
            addr_v4, count=3, interval=0.5, timeout=2)
        if result.is_alive:
            print("{} is reachable".format(addr_v4))
            healthy_addrs_v4.append(addr_v4)
        else:
            print("{} is not reachable".format(addr_v4))

    print("All reachable addresses: {}".format(
        healthy_addrs_v6 + healthy_addrs_v4))

    chosen_endpoint = ""
    if len(healthy_addrs_v6) > 0:
        chosen_endpoint = "[{}]:{}".format(healthy_addrs_v6[0], port)
        print("Chosen endpoint is {}".format(chosen_endpoint))
    elif len(healthy_addrs_v4) > 0:
        chosen_endpoint = "{}:{}".format(healthy_addrs_v4[0], port)
        print("Chosen endpoint is {}".format(chosen_endpoint))
    else:
        print("None of the tested endpoints ({}) were reachable.".format(
            addrs_v6 + addrs_v4))
        exit(-2)

    return chosen_endpoint


if __name__ == "__main__":
    main()

Conclusion

This solution is more robust than using the provided reresolve-dns.sh script, because this solution also implements a reachability check. This protects against situations where a tunnel may be viable via a different endpoint than the one that used to work, i.e. if IPv6 breaks or if there are multiple IP addresses for a given hostname.

I hope this is helpful for others who may be facing the same challenge!