Introduction
This is a story of ARP Proxy going rogue. Writing down that story took more than we expected so it’s split in two different posts.
In the first part we explained what proxy ARP is and how it’s used in GRNET Ganeti clusters to provide public IPv4 to guest vms. I referred to the incident of a certain host hijacking all IPv4 addresses within a VLAN.
In this second part we track down this particular behavior by reading the linux source code, setting up a Debian Buster testbed environment with network namespaces, and playing around with python scapy, eBPF Compiler Collection toolkit and linux kernel static tracepoints.
We assume reader is accustomed with basic linux networking. Even if not, do read on if you fancy linux kernel and low level networking stuff.
ARP Proxy going rogue, part 2: tracing the kernel
Simulating the routed setup environment within a virtual machine
So we wanted to experiment with ARP Proxy and we needed a more flexible environment where it would be safe to mess with the kernel. That’s why we decided to reproduce the networking setup of a ganeti bare metal node within a virtual machine. We chose Debian Buster (currently considered testing) to use a fresh kernel, 4.17.8.
Remember that, as depicted in the initial schema, each node has a bond0 logical interface, a bond0.90 vlan interface, then routing takes place towards the tap interface and finally packets arrive to the eth0 within the guest. One can imagine these as two pairs of interfaces, which can be both implemented with veth pairs inside the testbed vm.
ip link add bond0 type veth peer name vlan0
for i in bond0 vlan0 ; do ip link set dev $i up ; done
This creates a veth pair of interfaces, bond0 and vlan0, which form a pipe simulating bond0 and bond0.90 interfaces. No vlan (802.1q) tagging and untagging takes place, as it doesn’t actually affect the issue we’re investigating.
Then we simulate the tap-eth0 pair with another veth pair:
ip link add tap0 type veth peer name guest0
guest0 interface would normally reside within a KVM guest instance, but let’s not nest virtualize. Instead we can leverage linux namespaces and put it within a network namespace:
ip netns add guest
ip link set guest0 netns guest
Let’s assume that the guest uses a IPv4 address (which we’ll Proxy ARP later on) from a 10.0.0.0/29 subnet:
ip netns exec guest ip a add 10.0.0.2/29 dev guest0
ip netns exec guest ip link set dev guest0 up
ip link set dev tap0 up
Let’s also finalize the “routed” network setup by creating and populating the separate routing table:
ip r add 10.0.0.0/29 dev vlan0
echo "999 vlan0" >> /etc/iproute2/rt_tables
ip r add 10.0.0.0/29 dev vlan0 table vlan0
ip r add default via 10.0.0.1 dev vlan0 table vlan0
ip r add 10.0.0.2 dev tap0 table vlan0
enable proxy_arp and IP forwarding for both directions:
echo "1" > /proc/sys/net/ipv4/conf/vlan0/proxy_arp
echo "1" > /proc/sys/net/ipv4/conf/tap0/proxy_arp
echo "1" > /proc/sys/net/ipv4/conf/vlan0/forwarding
echo "1" > /proc/sys/net/ipv4/conf/tap0/forwarding
and put the rules in place:
ip rule add iif vlan0 lookup vlan0
ip rule add iif tap0 lookup vlan0
With a few minor details (IP mangling for ARP requests with arptables) the setup should be considered sufficient.
Injecting ARP request packets with python scapy
Using tcpdump to observe ARP traffic on bond0 interface is straightforward:
➜ alexaf-arp-buster ~ tcpdump -evni vlan0 arp
18:17:35.507324 ARP, Request who-has 10.0.0.2 tell 10.0.0.1, length 28
18:17:36.146665 ARP, Reply 10.0.0.2 is-at fa:2e:d3:ca:37:42, length 28
Then one needs a way to craft and inject ARP packets. scapy comes in handy when troubleshooting issues with your network, and at GRNET we’ve already used is back in the days when the shiny-new IP-clos installation was giving us hard time :)
With scapy one is able to forge a wide variety of packets and protocols, manipulating every aspect of them.
Our case essentially needs injecting ARP “who-has” packets with different target IPv4 addresses. Remember we’d like to verify under which circumstances a proxy_arp enabled host will ARP reply to ARP requests it receives from the network.
I came up with this simple python script:
#!/usr/bin/env python
from sys import argv
from scapy.sendrecv import sendp
from scapy.layers.l2 import Ether, ARP
sendp(Ether(dst='ff:ff:ff:ff:ff:ff', src='de:ad:be:ef:de:ad')/ARP(op=ARP.who_has,
hwsrc='de:ad:be:ef:de:ad', psrc="10.0.0.1", pdst="{}".format(argv[1])),
iface='bond0')
executed like this:
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.2
.
Sent 1 packets.
Scapy constructs a Ethernet frame with broadcast destination mac address, containing a ARP “who-has” request with variable destination IP (or target), and this packet is injected in bond0. Since bond0 is one end of the veth pair, the packet will necessarily appear on the vlan0 interface. This simulates the real situation where the physical host receives a packet on bond0, removes the vlan tag and moves it down to bond0.90 interface.
Test cases
For the rest of this post I’m going to probe and trace kernel while playing with host’s routes. Namely consider these cases:
- case A: normal condition where ip rules and routing table is intact
➜ alexaf-arp-buster ~ ip r show table vlan0
default via 10.0.0.1 dev vlan0
10.0.0.0/29 dev vlan0 scope link
10.0.0.2 dev tap0 scope link
- case B: problematic condition where routing tables have no entry for vlan0 interface (remember, such entries were removed during the incident by getting down and up the interface)
➜ alexaf-arp-buster ~ ip r
default via 83.212.173.1 dev eth0
83.212.173.0/24 dev eth0 proto kernel scope link src 83.212.173.108
➜ alexaf-arp-buster ~ ip r show table vlan0
10.0.0.2 dev tap0 scope link
So case A is all-good, case B is problems.
asciinema break
If you enjoy screencasts, you watch the aforementioned setup and ARP injections in:
Dive into the kernel source
Enough with the foreplay though! The essence of this post is diving into the Linux kernel source code (mainly because I spend several hours reading it, I suppose), and joyfully playing with kernel probing and tracing.
Our goal is to examine under which circumastances the proxy_arp host replies to ARP requests on behalf of other hosts’ IPs.
Given our testbed is Debian vm it makes sense to fetch the source that comprises the linux image debian package, instead of fetching directly the upstream tarball or repo. That’s useful for two reasons: a) the debian package source also includes any debian patches to the kernel, b) having the debian source package makes it pretty easy to recompile the kernel.
Although one may use ‘apt-get source’ I generally prefer to use dget to fetch the debian source of specific package and version:
dget http://deb.debian.org/debian/pool/main/l/linux/linux_4.17.8-1.dsc
dget will fetch both the upstream and the debian tarballs, will GPG verify them, will extract them and will apply the debian patches. Afterwards it’s pretty easy to apply own patches and rebuild package with, say, pbuilder.
Now, although this is handy, one should also have the linux-stable git repository on the side just to be able to scroll through the commit history of the code they are interested in.
The kernel source, where do we start?
The linux source land is vast and one needs some hints to start wandering. net/ipv4/arp.c is where most of the ARP functionality is implemented so it makes sense to start looking there.
The first reference of proxy_arp is in the arp_fwd_proxy function. But this is a helper function, called by the arp_process1 function. The latter is a function we’ll pay close attention to, as it’s in fact where the kernel decides if and how to react to a ARP request.
static int arp_process(struct net *net, struct sock *sk, struct sk_buff *skb)
The ARP packets fields are extracted from the socket buffer structure: source hardware address (sha), target hardwared address (tha), source IP (sip) and target (ip). These are neeeded for the kernel to decide further processing.
arp = arp_hdr(skb);
[...]
arp_ptr = (unsigned char *)(arp + 1);
sha = arp_ptr;
arp_ptr += dev->addr_len;
memcpy(&sip, arp_ptr, 4);
arp_ptr += 4;
tha = arp_ptr;
arp_ptr += dev->addr_len;
memcpy(&tip, arp_ptr, 4);
Skipping some (now irrelevant) checks, we find the first important if clause2:
if (arp->ar_op == htons(ARPOP_REQUEST) &&
ip_route_input_noref(skb, tip, sip, 0, dev) == 0) {
this is crucial for further processing and replying the request. The first condition is held (we know we inject a ARP request, “who-has”). The second, the ip_route_input_noref call needs further investigation.
ip_route_input_noref / Routing function calls
ip_route_input_noref is called with the above seen parameters, target IP (the content of ARP “who-has”) among them. As the name suggests this function has to do with routing, is in fact defined in net/ipv4/route.c, and is called from various non-ARP related places. This is an important crossroad, it marks a passing within the routing territory. A reason why ARP is closely connected to routing too.
Dynamically tracing functions calls with bpfcc
Wouldn’t it be great if along with reading the source we had real-time feedback of what’s going on in the kernel. For example, shouldn’t we verify the return value of ip_route_input_noref for various (injected) ARP requests?
Meet eBPF compiler collection tools! In this post I just scratch the tip of the iceberg. BCC provides a really wonderful toolset for low-level linux troubleshooting and exploration. Read http://www.brendangregg.com/ebpf.html for more.
After apt installing ‘bpfcc-tools’ I was able to simply:
➜ alexaf-arp-buster ~ trace-bpfcc 'r::ip_route_input_noref "ret: %d", retval'
PID TID COMM FUNC
5234 5234 python ip_route_input_noref ret: 0
5247 5247 python ip_route_input_noref ret: -22
for these injected ARP requests respectively:
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.2 &
[1] 5234
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.6 &
[1] 5247
This is the expected behavior, we’re in case A.
On the other hand, in case B, the problematic situation, ip_route_input_noref does returns 0 for 10.0.0.6 in case B. This means we’re on the right path.
Further into linux routing, fib_table_lookup
So ip_route_input_noref calls ip_route_input_rcu3 which in turn calls ip_route_input_slow4.
ip_route_input_slow is a rich function with multiple checks, for example checkins for martians based on destination and source IP addresses. Further down fib_lookup5 is called and in turn fib_table_lookup6, meaning looking up destination IP in the routing tables. Although this seems relevant, fib_table_lookupwill in fact always return the same values for both cases A and B for every destination IP. That’s because while traversing the routing tables, there will always be match, even if that’s the catch-all default gateway.
Return values don’t help here but what exactly happens within ‘fib_table_lookup’? Now this is a little more complex function to walkthrough. At the end of the function one may spot this line:
trace_fib_table_lookup_nh(nh);
This is not another kernel function. What is it then?
Verifying routing tables lookups with kernel static tracepoints
Lines like the above are static tracepoints, scarce within the source but carefully placed by the kernel developers. Placed in locations with the aim to facilitate both developers and system administrators tracking down common problems in various subsystems.
Static tracepoints have been around for quite some time. trace-bpfcc can actually handle static tracepoints as well, but I decided to follow the traditional way, just for the sake of it.
Under /sys/kernel/debug/tracing/events we’ll find the registered trace events, while lookint at fib/fib_table_lookup_nh/format we peek into the specific tracepoint message format and variables:
print fmt: "nexthop dev %s oif %d src %pI4", __get_str(name), REC->oif, REC->src
Let’s put it into action:
echo 1 > events/fib/fib_table_lookup_nh/enable
and inject two ARP requests under case A:
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.2
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.6
these commands give us the following trace lines respectively:
python-22575 [000] ..s1 51274.868742: fib_table_lookup_nh: nexthop dev tap0 oif 6 src 83.212.173.108
python-22588 [000] ..s1 51283.168723: fib_table_lookup_nh: nexthop dev vlan0 oif 3 src 83.212.173.108
Notice that 10.0.0.2 will exit tap0 while 10.0.0.6 will exit vlan0. That’s normal. Since explicit route exists for 10.0.0.6 the default gateway in table vlan0 will be used.
Now let’s move to case B:
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.6
then:
python-22675 [000] ..s1 51410.344435: fib_table_lookup_nh: nexthop dev eth0 oif 2 src 83.212.173.108
packet is going out the eth0 interface since no route was found in table vlan0 and search continued in table main:
python-22703 [000] ..s1 51497.300093: fib_table_lookup: table 255 oif 0 iif 3 src 10.0.0.1 dst 10.1.0.55 tos 0 scope 0 flags 0
python-22703 [000] ..s1 51497.300095: fib_table_lookup: table 999 oif 0 iif 3 src 10.0.0.1 dst 10.1.0.55 tos 0 scope 0 flags 0
python-22703 [000] ..s1 51497.300096: fib_table_lookup: table 254 oif 0 iif 3 src 10.0.0.1 dst 10.1.0.55 tos 0 scope 0 flags 0
So in any case, whether it’s a to-be-proxy-ARP-ed IP or not, fib_table_lookup will match a route and will return something.
The important bit here is this:
fib_table_lookup will also populate the res variable which is basically a pointer to a fib_result structure:
int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
struct fib_result *res, int fib_flags)
So to return back, ip_route_input_slow will call fib_lookup which will call fib_table_lookup and after traversing the routing tables for a given destination IPv4 will fill the “res” variable with the lookup findings.
__mkroute_input
Continuing our journey, ip_route_input_slow finally ends up calling ip_mkroute_input7 which in turn calls __mkroute_input8:
__mkroute_input(skb, res, in_dev, daddr, saddr, tos);
mind that res is passed to the call along with other parameters.
__mkroute_input then gets a “working reference to the output device”:
out_dev = __in_dev_get_rcu(FIB_RES_DEV(*res));
and finally(!!) there is this code:
if (skb->protocol != htons(ETH_P_IP)) {
if (out_dev == in_dev) {
err = -EINVAL;
goto cleanup;
}
}
This checks socket buffer’s procotol and in case it’s not IP it must be ARP, and then checks whether out_dev (outgoing interface), as discovered earlier from the fib_lookup equals in_dev (incoming interface). If so, returns “-EINVAL” or “-22” which exactly what we’re seeing here:
➜ alexaf-arp-buster ~ trace-bpfcc 'r::ip_route_input_noref "ret: %d", retval'
PID TID COMM FUNC
5234 5234 python ip_route_input_noref ret: 0
5247 5247 python ip_route_input_noref ret: -22
pr_debug, the ultimate resort
At this point I wanted to be sure about the out_dev == in_dev code. I had already recompiled the kernel a couple of times to add “EXPORT_SYMBOL” for some functions that wouldn’t get kprobed. So I just went ahead and patched:
--- linux-4.17.8.orig/net/ipv4/route.c
+++ linux-4.17.8/net/ipv4/route.c
@@ -1719,7 +1719,7 @@ static int __mkroute_input(struct sk_buf
*/
if (out_dev == in_dev &&
IN_DEV_PROXY_ARP_PVLAN(in_dev) == 0) {
+ pr_debug("ARP_DEBUG: %pI4 out_dev == in_dev \n", &daddr);
err = -EINVAL;
goto cleanup;
}
then:
➜ alexaf-arp-buster ~ echo 8 > /proc/sys/kernel/printk
➜ alexaf-arp-buster ~ echo 'file net/ipv4/route.c +p' > /sys/kernel/debug/dynamic_debug/control
➜ alexaf-arp-buster ~ ./my_arp.py 10.0.0.6
.
Sent 1 packets.
➜ alexaf-arp-buster ~ dmesg | grep ARP_DEBUG
[ 585.324901] IPv4: ARP_DEBUG: 10.0.0.6 out_dev == in_dev
Where are we getting at?
So after this, probably tiring, excursion to the source code and after multiple snippets where do we stand?
What have we learned?
For every ARP request linux will lookup the routing tables for the destination IP. Most probably a route will be found, even if it’s the default gateway. Then, in case proxy_arp is enabled and if and only if the host would route traffic to that destination IP through a device that is different than one the ARP request originated from, then a ARP reply will be produced.
This makes sense for all cases where the host essentially interconnects two different broadcast domains that share the same IP subnet addressing. This is in fact the case for our Ganeti “routed networks” setup.
Pondering over the default gateway
The problematic behavior arises when while crossing host’s routing tables a “out_dev != in_dev” entry is found and “out_dev” basically corresponds to host’s default gateway.
The big question is this: should the default gateway routing entry be considered in this process? The default gateway is a catch-all entry, so even a ARP “who-has” for 8.8.8.8 would be answered:
➜ alexaf-arp-buster ~ ip r
default via 83.212.173.1 dev eth0
10.0.0.0/29 dev vlan0 scope link
83.212.173.0/24 dev eth0 proto kernel scope link src 83.212.173.108
➜ alexaf-arp-buster ~ ip r show table vlan0
10.0.0.0/29 dev vlan0 scope link
10.0.0.2 dev tap0 scope link
➜ alexaf-arp-buster ~ ./my_arp.py 8.8.8.8
.
Sent 1 packets.
➜ alexaf-arp-buster ~ tcpdump -eni vlan0 arp
16:32:41.593373 de:ad:be:ef:de:ad > ff:ff:ff:ff:ff:ff, ethertype ARP (0x0806), length 42: Request who-has 8.8.8.8 tell 10.0.0.1, length 28
16:32:41.844640 32:6b:33:03:b2:c2 > de:ad:be:ef:de:ad, ethertype ARP (0x0806), length 42: Reply 8.8.8.8 is-at 32:6b:33:03:b2:c2, length 28
Of course legitimate/valid ARP requests only refer to IPv4 addresses within the same subnet. But this just illustrates the fact that the default gateway should probably be not taken into consideration for proxy ARP.
To get things crazy, as this post ends, how does BSD implement Proxy ARP? :P
Links
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/arp.c#L842
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/arp.c#L817
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/route.c#L2103
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/route.c#L2157
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/route.c#L1972
- https://elixir.bootlin.com/linux/v4.17.8/source/include/net/ip_fib.h#L330
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/route.c#L1997
- https://elixir.bootlin.com/linux/v4.17.8/source/net/ipv4/route.c#L1878