Testing a routing protocol using network namespaces

By akash_rawal, created: 2024-03-10, last modified: 2024-06-04

How do you test your own routing protocol without a real network?

Background

For a homelab, all I have is an Intel NUC acting as a singular server. It is working as great as it could, but the single server setup leaves more to be desired. Every time I take it down for maintenance, for the time the server is down, I lose access to the services running on the server, and the internet. I wish to upgrade to a cluster of servers so that I could have additional servers to pick up the slack automatically. I could set up a Pacemaker cluster to ensure that only one active server will handle my default route and all my services. But then I would be paying for N servers and getting the benefit of one. If I have N servers, I want the capacity of N servers, or at-least a decent fraction of N. After some searching I figured out that VyOS and other router software can achieve this, but I don't want the complexity of managing N different routers. I want 1 router. One, distributed router.

Baby steps first

The goal

First, let's figure out a way to reliably discover each node's immediate neighbors, and via how many ways we can connect to them.

Each node sends its information, like its name, public keys, etc via IPv6 link local multicast packets. I'll call these packets advertisements. I choose IPv6 link local because no manual IP address configuration is required. The kernel automatically assigns IPv6 link local addresses for each network interface.

When another node (B) receives the multicast packets, it knows that it can receive packets from the former node (A), but what about the other way round? For that, node B then establishes a TCP connection to node A. If node B can connect to node A, we can conclude that a bidirectional communication is possible between node A and node B.

We also need to deal with a potential race condition. What if both node A and node B connect to each other simultaneously? We need a tie breaker to decide which connection to keep. All we need to do is to assign a random number called rank to each connection, and simply keep the connection with highest rank.

But what if the two nodes are connected by more than one links? What if there is a switch in between? The number of TCP connections will increase quadratically, and before long we'd be eating file descriptors for breakfast. So, instead of establishing a new TCP connection for each unique advertisement, the two nodes will only maintain a single TCP connection. The two nodes share advertisements received from each other. Thus for each pair of nodes A and B node A has two sets of advertisements:

Advertisements received from node B over multicast UDP
Node A's advertisements reported as received by node B over TCP connection

The intersection of these two sets will provide all working bidirectional communication paths between node A and node B (without routing, that is.)

An example

Consider the following scenario:

 -----------------                                            -----------------
 |               |  eth0(fe80::a0)           eth0(fe80::b0)   |               |
 |      Node     |--------------------------------------------|      Node     |
 |               |                                            |               |
 |       A       |  eth1(fe80::a1)           eth1(fe80::b1)   |       B       |
 |               |--------------------------------------------|               |
 |               |                                            |               |
 -----------------                                            -----------------

There are two network interfaces attached to each node, namely eth0 and eth1. The network interfaces are shown above with their respective link local addresses.

What happens if Node A sends an advertisement via eth0?

  ----------                                   ----------
  | Node A |                                   | Node B |
  ----------                                   ----------
      |                                            |
      | Advertisement(name: A)                     |
      |------------------------------------------->|
      | eth0                                  eth0 |
      |                                            |
      |                                TCP connect |
      |<-------------------------------------------|
      | eth0                                  eth0 |
      |                                            |

Once node B receives the advertisement, it establishes a TCP connection back to Node A. Now that TCP connection is established, both nodes A and B are aware that bidirectional communication is possible between the two nodes via eth0 network interface.

Discovery of the second link (eth1 -- eth1) happens without creating another TCP connection.

  ----------                                                    ----------
  | Node A |                                                    | Node B |
  ----------                                                    ----------
      |                                                             |
      | Advertisement(name: A)                                      |
      |------------------------------------------------------------>|
      | eth0                                                   eth0 |
      |                                                             |
      |                                                 TCP connect |
      |<------------------------------------------------------------|
      | eth0                                                   eth0 |
      |                                                             |
      | Advertisement(name: A)                                      |
      |------------------------------------------------------------>|
      | eth1                                                   eth1 |
      |                                                             |
      |         Heartbeat(received_adverts: [fe80::a1 -> fe80::b1]) |
      |<------------------------------------------------------------|
      | eth0                                                   eth0 |
      |                                                             |
      |                                      Advertisement(name: B) |
      |<------------------------------------------------------------|
      | eth1                                                   eth1 |
      |                                                             |
      | Heartbeat(received_adverts: [fe80::b1 -> fe80::a1])         |
      |------------------------------------------------------------>|
      | eth0                                                   eth0 |
      |                                                             |

The nodes share the received advertisements over the already established TCP connection. They aren't sent on demand, but rather as part of periodic heartbeat messages sent over the TCP connection.

Testing time

Network namespaces, without root

Linux has network namespaces. Basically you divide the Linux network stack into partitions; each partition behaves as a separate computer from network point of view. You can easily do this using unshare --net command and you'll have a shell running in a new, isolated network namespace.

# ip link show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if2: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
    link/ether 2a:66:b1:f2:e8:80 brd ff:ff:ff:ff:ff:ff link-netnsid 0
# unshare --net
# ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
#

You can make these isolated namespaces reachable by either moving real network interfaces into them, or by creating veth pairs and moving one of them into the network namespace.

One small problem, you need to be root to unshare a network namespace, and you don't want to run your test suite as root.

$ unshare --net
unshare: unshare failed: Operation not permitted
$

Fortunately, we have user namespaces. For this article all you need to understand is that by unshareing a user namespace, you can create a sandbox where your process has root access within the box, but not outside it.

$ unshare --user --map-root-user
# ls -l .bashrc /usr/bin/bash
-rw-r--r-- 1 root   root       933 Feb  6  2023 .bashrc
-rwxr-xr-x 1 nobody nobody 1112880 Jan 16 16:18 /usr/bin/bash
#

In a nutshell, my UID has been mapped to root and all other UIDs have been made inaccessible. This does not give me any special privileges.

# : > /usr/bin/bash
-bash: /usr/bin/bash: Permission denied

But the important bit, can I unshare a network namespace now?

$ unshare --user --map-root-user
# ls -l .bashrc /usr/bin/bash
-rw-r--r-- 1 root   root       933 Feb  6  2023 .bashrc
-rwxr-xr-x 1 nobody nobody 1112880 Jan 16 16:18 /usr/bin/bash
# unshare --net
# ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
#

Yes!

Remote controlled worker process

User namespaces solves the problem of emulating a network situation without any virtualization or any container runtime (like the one shown in the first figure), but how would you set them up? There is nsenter but it needs actual root user to work.

$ unshare --user --map-root-user --net sleep infinity &
[1] 208920
$ nsenter --net=/proc/208920/ns/net -- ip link show
nsenter: reassociate to namespace 'ns/net' failed: Operation not permitted
$ sudo nsenter --net=/proc/208920/ns/net -- ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
$

And by the way /proc/<pid>/ns/net file represents the network namespace of a given process.

So I made a remote controlled worker process to execute commands from within a network namespace, which connects to the test suite listening at a unix domain socket.

The first worker process is run in a user namespace via unshare; I call it the hub process. Once the hub process connects to the test suite, it can be made to spawn additional worker processes, each running under separate network namespaces via unshare. All the worker processes connect to the same unix domain socket and can be controlled individually and in the right order determined by the test suite.

Why do I need a hub process? That is because each network namespace needs to be under a common user namespace so that network interfaces can be moved between them.

$ unshare --user --map-root-user --net sleep infinity &
[1] 211416
$ unshare --user --map-root-user --net
# ip link add eth0 type veth peer name temp
# ip link set temp netns 211416 name eth0
RTNETLINK answers: Operation not permitted
#

Having a hub process makes it easy to have a common user namespace.

$ unshare --user --map-root-user
# unshare --net sleep infinity &
[1] 211458
# ip link add eth0 type veth peer name temp
# ip link set temp netns 211458 name eth0
# ip link show
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
3: eth0@if2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 66:e8:e1:8a:99:c0 brd ff:ff:ff:ff:ff:ff link-netnsid 0

Which means, finally I have a decent way to set up a test scenario, while keeping up the readability of test cases.

#[tokio::test]
async fn ping() {
    let mut hub = Hub::new().await.unwrap();

    let mut a = hub.namespace().await.unwrap();
    let mut b = hub.namespace().await.unwrap();

    //Connect two namespaces
    a.connect("eth0", "eth0", &mut b).await.unwrap();

    //Assign IP addresses
    a.sh("ip addr add 192.168.0.1/24 dev eth0").await.unwrap();
    a.sh("ip link set eth0 up").await.unwrap();
    b.sh("ip addr add 192.168.0.2/24 dev eth0").await.unwrap();
    b.sh("ip link set eth0 up").await.unwrap();

    //Ping test
    a.sh("ping -c 1 192.168.0.2").await.unwrap();
}

You can check out the code behind it at https://gitlab.com/akash_rawal/nwlab2/-/tree/master/src/worker?ref_type=heads.

If you browse the rest of the repository, you can also find tests and implementation of the basic neighbor discovery algorithm that I was talking about.

Conclusion

My first attempt to write tests comprised of using separate code blocks to initialize each network namespace, and then make them callable from main. That grew ugly very quickly. I mean just look at it.

pub fn routine() -> Routine {
    Routine::from(|| async move {
        cmd::run(&mut cmd::new_ns(".a")).await.unwrap();
    }).push(".a", Routine::from(|| async move {
        cmd::init_netns().await.unwrap();

        //Spawn a child in new network namespace and send it the other end
        let mut child = cmd::fork_ns(".c").spawn().unwrap();

        //Connect to it
        let pid = child.id().unwrap();
        cmd::new_veth("eth0", "eth0", pid).await.unwrap();

        //Assign IP address
        cmd::bash("ip addr add 172.16.0.1/16 dev eth0").await.unwrap();
        cmd::bash("ip link set eth0 up").await.unwrap();

        //Run echo client
        crate::svc::echo_client("172.16.0.2:7").await.unwrap();

        //Wait for child
        child.wait().await.unwrap();

        log::info!("Client ended");
    }).push(".c", Routine::from(|| async {
        cmd::init_netns().await.unwrap();
        
        //Wait for the network link
        cmd::wait_for_link("eth0").await.unwrap();
        log::info!("From child:");
        cmd::bash("ip link show").await.unwrap();

        //Assign IP address
        cmd::bash("ip addr add 172.16.0.2/16 dev eth0").await.unwrap();
        cmd::bash("ip link set eth0 up").await.unwrap();

        //Run echo server
        crate::svc::echo_server_1("172.16.0.2:7").await;

        log::info!("Server ended");
    })))
}

It is a very similar test case, but you see that cmd::wait_for_link("eth0") in there? That is a busy wait that repeatedly checks for whether the given network interface exists. I am quite happy with how easily I can set up test scenarios. The neighbor discovery itself is not the star of this show; it takes ~5 seconds to converge and needs a lot more work.