Automated Wireguard endpoint updates with reachability checks
Background
When a Wireguard tunnel is enabled, the configuration system parses the configuration file and picks an IP address endpoint using DNS resolution for any hostnames it finds. This only happens once, when the tunnel is first enabled.
Let’s say we have the following Wireguard interface configuration file.
File: /etc/wireguard/wg0.conf
[Interface]
PrivateKey = 6KlCNMKAkKcfT+iQxfl7ABTz4yiso5iwGjJgdGVu9VQ=
Address = 172.25.1.2/24,fd08:5771:2371::2/64
[Peer]
PublicKey = fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI=
AllowedIPs = 172.25.1.1/24,fd08:5771:2371::1/64
PersistentKeepalive = 20
Endpoint = my-wireguard-peer.mydomain.tld:51820
When the interface is brought up, my-wireguard-peer.mydomain.tld
will be resolved to whatever IP
seems appropriate at that time, and this resolution will never be repeated unless Wireguard is restarted.
If the peer endpoint hostname has both an A
(IPv4) and an AAAA
(IPv6) record, Wireguard will select
an IPv6 address for the peer if the local host has a known route to that IPv6 address. It will fall back to an IPv4
address for the peer if the local host only has a known route to that IPv4 address.
Assuming that my-wireguard-peer.mydomain.tld
resolves to 2001:db8::1
and your local host thinks it has IPv6
connectivity, you’ll end up with a Wireguard configuration that looks something like this.
Run sudo wg show wg0
:
interface: wg0
public key: etGn5y6izVL8im6ZDEERzChKdfzMUiscao0QRTDmGXA=
private key: (hidden)
listening port: 33125
peer: fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI=
endpoint: 2001:db8::1:51820
allowed ips: 172.25.1.1/24,fd08:5771:2371::1/64
latest handshake: 16 seconds ago
transfer: 4.03 MiB received, 405.21 KiB sent
persistent keepalive: every 20 seconds
So far so good, as long as 2001:db8::1:51820
is (and remains) a valid, reachable endpoint for this peer.
Problem
There are several situations where the correct / valid endpoint may change:
- The peer has disappeared, come back under a different IP address, and updated its DNS entry.
- There are multiple
A
orAAAA
records for the endpoint hostname, and only some of them are actually reachable. - The IPv6 connectivity on the local host may be broken (i.e. supposedly has a route, but it doesn’t actually work or is flaky).
In any of these cases, you’ll be stuck with a broken Wireguard setup, since Wireguard does not perform self-diagnostics or attempt to recover. This is a deliberate design decision; it is currently the responsibility of the user to perform endpoint selection (see here).
In my specific case, I use Wireguard on field-deployed hosts to tunnel out of a wide variety of networks, some with flaky or nonfunctional IPv6 implementations that I cannot control. It is important to me that these remote hosts have a robust mechanism to tunnel out, even under shifting network conditions, so they don’t become stranded.
I found that if a remote peer was able to tunnel out successfully via IPv6, but then IPv6 later became flaky or broken, the peer would be unreachable, since Wireguard would continue trying to use the IPv6 endpoint.
Wireguard does have a reresolve-dns.sh
script available that can be called to re-resolve DNS
(see here),
but this script does not perform any reachability checks. If you run it periodically using cron, you’ll solve
1.
from above (changing DNS), but you won’t solve 2.
or 3.
from above (multiple addresses or protocol / routing
issues).
Other applications implement what is known as “happy eyeballs” (see here) to fall back to IPv4 if IPv6 fails. They typically will also try alternate IP addresses from a DNS entry if one is found to be unreachable. Because Wireguard does not do these things, it is our responsibility to do them ourselves.
Goals
For a given hostname (i.e. my-wireguard-peer.mydomain.tld
), Wireguard should be configured with a “reachable”
peer IP address at all times, even if that peer IP address changes due to changing network conditions or peer IP
addresses.
- If IPv6 breaks, Wireguard should continue to work by using an IPv4 address for the peer (i.e. “happy eyeballs”).
- If a given IP address stops working, Wireguard should continue to work if any of the other IP addresses still work.
Solution Overview
The solution I use is a Python script, run every minute via cron, which attempts to determine “reachability” for
every IP address that a given hostname resolves to. Once a suitable candidate is determined, it runs a wg set
command to update the peer endpoint for the given peer.
“Reachability” in this case is via ICMP. This doesn’t actually indicate whether Wireguard works, since it’s a completely different protocol, but it can be a reasonable hint, if you have your Wireguard peer host configured to respond to ICMP.
Solution Details
I have a Python script configured to run every minute using cron on these remotely-deployed Wireguard peers. It requires icmplib. The script is invoked as follows:
./wireguard_endpoint_manager.py \
--interface wg0 \
--peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= \
--host my-wireguard-peer.mydomain.tld \
--port 51820
When fixing the specific case I outlined above, the script will produce output that looks something like this:
Candidate addresses: ['2001:db8::1', '203.0.113.1']
2001:db8::1 is not reachable
203.0.113.1 is reachable
All reachable addresses: ['203.0.113.1']
Chosen endpoint is 203.0.113.1:51820
Existing endpoint is [2001:db8::1]:51820
Executing wg set wg0 peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= endpoint 203.0.113.1:51820
Done executing wg set wg0 peer fFZZy7EIFsMwoOLXJgeJOV3S0XAB9VaW7Ig7Dq12zCI= endpoint 203.0.113.1:51820
This should allow the Wireguard tunnel to re-establish whenever network conditions change.
The script contents (wireguard_endpoint_manager.py
) are as follows:
#!/usr/bin/env python3
import argparse
import os
import shlex
import socket
import subprocess
import traceback
from contextlib import suppress
import icmplib
def main():
parser = argparse.ArgumentParser(description="Manage Wireguard Endpoint")
parser.add_argument("--dryrun", action='store_true')
parser.add_argument("--interface", required=True)
parser.add_argument("--peer", required=True)
parser.add_argument("--host", required=True)
parser.add_argument("--port", required=True)
args = parser.parse_args()
dryrun = args.dryrun
interface = args.interface
peer = args.peer
host = args.host
port = args.port
chosen_endpoint = pick_endpoint(host, port)
update_endpoint(dryrun, interface, peer, chosen_endpoint)
def update_endpoint(dryrun, interface, peer, chosen_endpoint):
wg_test_command = "wg show {} endpoints".format(interface)
ret = subprocess.check_output(shlex.split(wg_test_command), shell=False)
existing_endpoint = os.fsdecode(ret).rstrip().split('\t')[1]
print("Existing endpoint is {}".format(existing_endpoint))
if existing_endpoint == chosen_endpoint:
print("Nothing to do: endpoint is already {}".format(
existing_endpoint))
exit(0)
wg_update_command = "wg set {} peer {} endpoint {}".format(
interface, peer, chosen_endpoint)
if dryrun:
print("DRY RUN: Would have executed {}".format(wg_update_command))
else:
print("Executing {}".format(wg_update_command))
ret = subprocess.call(shlex.split(wg_update_command), shell=False)
if ret == 0:
print("Done executing {}".format(wg_update_command))
else:
print("Error executing {}: {}".format(wg_update_command, ret))
raise Exception("Error executing {}: {}".format(
wg_update_command, ret))
exit(0)
def pick_endpoint(host, port):
addrs_v6 = []
addrs_v4 = []
with suppress(socket.gaierror):
v6_results = socket.getaddrinfo(host, None, socket.AF_INET6)
for v6_result in v6_results:
addrs_v6.append(v6_result[4][0])
# De-dupe
addrs_v6 = list(dict.fromkeys(addrs_v6))
# filter ipv6-mapped addresses
addrs_v6 = [x for x in addrs_v6 if not x.startswith("::ffff:")]
with suppress(socket.gaierror):
v4_results = socket.getaddrinfo(host, None, socket.AF_INET)
for v4_result in v4_results:
addrs_v4.append(v4_result[4][0])
# De-dupe
addrs_v4 = list(dict.fromkeys(addrs_v4))
if len(addrs_v6) == 0 and len(addrs_v4) == 0:
print("No addresses resolved for {}.".format(host))
exit(-1)
print("Candidate addresses: {}".format(addrs_v6 + addrs_v4))
healthy_addrs_v6 = []
for addr_v6 in addrs_v6:
try:
result = icmplib.ping(
addr_v6, count=3, interval=0.5, timeout=2)
if result.is_alive:
print("{} is reachable".format(addr_v6))
healthy_addrs_v6.append(addr_v6)
else:
print("{} is not reachable".format(addr_v6))
except Exception:
print("Encountered error testing {} for reachability: {}".format(
addr_v6, traceback.format_exc()))
healthy_addrs_v4 = []
for addr_v4 in addrs_v4:
try:
result = icmplib.ping(
addr_v4, count=3, interval=0.5, timeout=2)
if result.is_alive:
print("{} is reachable".format(addr_v4))
healthy_addrs_v4.append(addr_v4)
else:
print("{} is not reachable".format(addr_v4))
except Exception:
print("Encountered error testing {} for reachability: {}".format(
addr_v4, traceback.format_exc()))
print("All reachable addresses: {}".format(
healthy_addrs_v6 + healthy_addrs_v4))
chosen_endpoint = ""
if len(healthy_addrs_v6) > 0:
chosen_endpoint = "[{}]:{}".format(healthy_addrs_v6[0], port)
print("Chosen endpoint is {}".format(chosen_endpoint))
elif len(healthy_addrs_v4) > 0:
chosen_endpoint = "{}:{}".format(healthy_addrs_v4[0], port)
print("Chosen endpoint is {}".format(chosen_endpoint))
else:
print("None of the tested endpoints ({}) were reachable.".format(
addrs_v6 + addrs_v4))
exit(-2)
return chosen_endpoint
if __name__ == "__main__":
main()
Conclusion
This solution is more robust than using the provided reresolve-dns.sh
script, because this solution also implements
a reachability check. This protects against situations where a tunnel may be viable via a different endpoint than
the one that used to work, i.e. if IPv6 breaks or if there are multiple IP addresses for a given hostname.
I hope this is helpful for others who may be facing the same challenge!