Comparison of Networking Solutions for Kubernetes

Note

Below is a reprint of an article for MZ Platform research repository. The original research and text were done by me in December 2015, Constantine Molchanov did an editing and layout. To my knowledge MZ Platform is long gone, so I've decided to post a backup version here.

Kubernetes requires that each container in a cluster has a unique, routable IP. Kubernetes doesn't assign IPs itself, leaving the task to third-party solutions.

In this study, our goal was to find the solution with the lowest latency, highest throughput, and the lowest setup cost. Since our load is latency-sensitive, our intent is to measure high percentile latencies at relatively high network utilization. We particularly focused on the performance under 30–50% of the maximum load, because we think this best represents the most common use cases of a non-overloaded system.

Competitors

Docker with --net=host

This was our reference setup. All other competitors are compared against this setup.

The --net=host option means that containers inherit the IPs of their host machines, i.e. no network containerization is involved.

A priori, no network containerization performs better than any network containerization; this is why we used this setup as a reference.

Flannel

Flannel is a virtual network solution maintained by the CoreOS project. It's a well-tested, production ready solution, so it has the lowest setup cost.

When you add a machine with flannel to the cluster, flannel does three things:

  1. Allocates a subnet for the new machine using etcd

  2. Creates a virtual bridge interface on the machine (called docker0 bridge)

  3. Sets up a packet forwarding backend:

    aws-vpc

    Register the machine subnet in the Amazon AWS instance table. The number of records in this table is limited by 50, i.e. you can't have more than 50 machines in a cluster if you use flannel with aws-vpc. Also, this backend works only with Amazon's AWS.

    host-gw

    Create IP routes to subnets via remote machine IPs. Requires direct layer2 connectivity between hosts running flannel.

    vxlan

    Create a virtual VXLAN interface.

Because flannel uses a bridge interface to forward packets, each packet goes through two network stacks when travelling from one container to another.

IPvlan

IPvlan is driver in the Linux kernel that lets you create virtual interfaces with unique IPs without having to use a bridge interface.

To assign an IP to a container with IPvlan you have to:

  1. Create a container without a network interface at all
  2. Create an ipvlan interface in the default network namespace
  3. Move the interface to the container's network namespace

IPvlan is a relatively new solution, so there are no ready-to-use tools to automate this process. This makes it difficult to deploy IPvlan with many machines and containers, i.e. the setup cost is high.

However, IPvlan doesn't require a bridge interface and forwards packets directly from the NIC to the virtual interface, so we expected it to perform better than flannel.

Load Testing Scenario

For each competitor we run these steps:

  1. Set up networking on two physical machines
  2. Run tcpkali in a container on one machine, let is send requests at a constant rate
  3. Run Nginx in a container on the other machine, let it respond with a fixed-size file
  4. Capture system metrics and tcpkali results

We ran the benchmark with the request rate varying from 50,000 to 450,000 requests per second (RPS).

On each request, Nginx responded with a static file of a fixed size: 350 B (100 B content, 250 B headers) or 4 KB.

Results

  1. IPvlan shows the lowest latency and the highest maximum throughput. Flannel with host-gw and aws-vpc follows closely behind, however host-gw shows better results under maximum load.
  2. Flannel with vxlan shows the worst results in all tests. However, we suspect that its exceptionally poor 99.999 percentile is due to a bug.
  3. The results for a 4 KB response are similar to those for 350 B response, with two noticeable differences:
    • the maximum RPS point is much lower, because with 4 KB responses it takes only ≈270k RPS to fully load a 10 Gbps NIC
    • IPvlan is much closer to --net=host near the throughput limit

Our current choice is flannel with host-gw. It doesn't have many dependencies (e.g. no AWS or new Linux version required), it's easy to set up compared to IPvlan, and it has sufficient performance characteristics. IPvlan is our backup solution. If at some point flannel adds IPvlan support, we'll switch to it.

Even though aws-vpc performed slightly better than host-gw, its 50 machine limitation and the fact that it's hardwired to Amazon's AWS are a dealbreaker for us.

50 kRPS, 350 B

350 byte request at 50 kRPS, latency graph

At 50,000 requests per second, all candidates show acceptable performance. You can already see the main trend: IPVlan performs the best, host-gw and aws-vpc follow closely behind, vxlan is the worst.

150 kRPS, 350 B

350 byte request at 150 kRPS, latency graph
Latency percentiles at 150,000 RPS (≈30% of maximum RPS), ms
Setup95 %ile99 %ile99.5 %ile99.99 %ile99.999 %ileMax Latency
IPvlan0.70.916.79.918
aws-vpc0.91.11.26.59.815.7
host-gw0.91.11.25.99.624.3
vxlan1.21.51.66.6201.9405.3
--net=host0.50.60.64.88.911.8

IPvlan is slightly better than host-gw and aws-vpc, but it has the worst 99.99 percentile. host-gw performs slightly better than aws-vpc.

250 kRPS, 350 B

350 byte request at 250 kRPS, latency graph

This load is also expected to be common on production, so these results are particularly important.

Latency percentiles at 250,000 RPS (≈50% of maximum RPS), ms
Setup95 %ile99 %ile99.5 %ile99.99 %ile99.999 %ileMax Latency
IPvlan11.21.46.310.124.3
aws-vpc1.21.51.65.69.427.3
host-gw1.11.41.68.611.240.1
vxlan1.51.92.116.6202.4245.5
--net=host0.70.80.93.77.716.8

IPvlan again shows the best performance, but aws-vpc has the best 99.99 and 99.999 percentiles. host-gw outperforms aws-vpc in 95 and 99 percentiles.

350 kRPS, 350 B

350 byte request at 350 kRPS, latency graph 350 byte request at 350 kRPS, latency graph without vxlan

In most cases, the latency is close to the 250,000 RPS, 350 B case, but it's rapidly growing after 99.5 percentile, which means that we are getting close to the maximum RPS.

450 kRPS, 350 B

This is the maximum RPS that produced sensible results.

IPvlan leads again with latency ≈30% worse than that of --net-host:

350 byte request at 450 kRPS, latency graph for net host 350 byte request at 450 kRPS, latency graph for IPVlan

Interestingly, host-gw performs much better than aws-vpc:

350 byte request at 450 kRPS, latency graph

500 kRPS, 350 B

Under 500,000 RPS, only IPvlan still works and even outperforms --net=host, but the latency is so high that we think it would be of no use to latency-sensitive applications.

350 byte request at 500 kRPS, latency graph for net host

50 kRPS, 4 KB

4 kilobyte request at 50 kRPS, latency graph

Bigger response results in higher network usage, but the leaderboard looks pretty much the same as with the smaller response:

Latency percentiles at 50,000 RPS (≈20% of maximum RPS), ms
Setup95 %ile99 %ile99.5 %ile99.99 %ile99.999 %ileMax Latency
IPvlan0.60.80.95.79.615.8
aws-vpc0.70.915.69.8403.1
host-gw0.70.917.412202.5
vxlan0.81.11.25.7201.5402.5
--net=host0.50.70.76.49.914.8

150 kRPS, 4 KB

4 kilobyte request at 150 kRPS, latency graph

Host-gw has a surprisingly poor 99.999 percentile, but it still shows good results for lower percentiles.

Latency percentiles at 150,000 RPS (≈60% of maximum RPS), ms
Setup95 %ile99 %ile99.5 %ile99.99 %ile99.999 %ileMax Latency
IPvlan11.31.55.3201.3405.7
aws-vpc1.21.51.7611.1405.1
host-gw1.21.51.77211405.3
vxlan1.41.71.96202.51406
--net=host0.91.21.34.29.5404.7

250 kRPS, 4 KB

4 kilobyte request at 250 kRPS, latency graph without vxlan

This is the maximum RPS with big response. aws-vpc performs much better than host-gw, unlike the small response case.

vxlan was excluded from the graph once again.

Test Environment

Background

To understand this article and reproduce our test environment, you should be familiar with the basics of high performance.

These articles provide useful insights on the topic:

Machines

  • We use two c4.8xlarge instances by Amazon AWS EC2 with CentOS 7.
  • Both machines have enhanced networking enabled.
  • Each machine is NUMA with 2 processors; each processor has 9 cores, each core has 2 hyperthreads, which effectively allows to run 36 threads on each machine.
  • Each machine has a 10Gbps network interface card (NIC) and 60 GB memory.
  • To support enhanced networking and IPvlan, we've installed Linux kernel 4.3.0 with Intel's ixgbevf driver.

Setup

Modern NICs offer Receive Side Scaling (RSS) via multiple interrupt request (IRQ) lines. EC2 provides only two interrupt lines in a virtualized environment, so we tested several RSS and Receive Packet Steering (RPS) Receive Packet Steering (RPS) configurations and ended up with following configuration, partly suggested by the Linux kernel documentation:

IRQ

The first core on each of the two NUMA nodes is configured to receive interrupts from NIC.

To match a CPU to a NUMA node, use lscpu:

$ lscpu | grep NUMA
NUMA node(s):          2
NUMA node0 CPU(s):     0-8,18-26
NUMA node1 CPU(s):     9-17,27-35

This is done by writing 0 and 9 to /proc/irq/<num>/smp_affinity_list, where IRQ numbers are obtained with grep eth0 /proc/interrupts:

$ echo 0 > /proc/irq/265/smp_affinity_list
$ echo 9 > /proc/irq/266/smp_affinity_list
RPS

Several combinations for RPS have been tested. To improve latency, we offloaded the IRQ handling processors by using only CPUs 1–8 and 10–17. Unlike IRQ's smp_affinity, the rps_cpus sysfs file entry doesn't have a _list counterpart, so we use bitmasks to list the CPUs to which RPS can forward traffic[1]:

$ echo "00000000,0003fdfe" \
    > /sys/class/net/eth0/queues/rx-0/rps_cpus
$ echo "00000000,0003fdfe" \
    > /sys/class/net/eth0/queues/rx-1/rps_cpus
Transmit Packet Steering (XPS)

All NUMA 0 processors (including HyperThreading, i.e. CPUs 0-8, 18-26) were set to tx-0 and NUMA 1 (CPUs 9-17, 27-37) to tx-1[2]:

$ echo "00000000,07fc01ff" \
    > /sys/class/net/eth0/queues/tx-0/xps_cpus
$ echo "0000000f,f803fe00" \
    > /sys/class/net/eth0/queues/tx-1/xps_cpus
Receive Flow Steering (RFS)

We're planning to use 60k permanent connections, official documentation suggests to round up it to the nearest power of two:

$ echo 65536 \ > /proc/sys/net/core/rps_sock_flow_entries
$ echo 32768 \ > /sys/class/net/eth0/queues/rx-0/rps_flow_cnt
$ echo 32768 \ > /sys/class/net/eth0/queues/rx-1/rps_flow_cnt
nginx

nginx uses 18 workers, each worker has it's own CPU (0-17). This is set by the worker_cpu_affinity option:

workers 18;
worker_cpu_affinity 1 10 100 1000 10000 ...;
tcpkali

Tcpkali doesn't have built-in CPU affinity support. In order to make use of RFS, we run tcpkali in a taskset and tune the scheduler to make thread migrations happen rarely:

$ echo 10000000 \ > /proc/sys/kernel/sched_migration_cost_ns
$ taskset -ac 0-17 \ tcpkali --threads 18 ...

This setup allows us to spread interrupt load across the CPU cores more uniformly and achieve better throughput with the same latency compared to the other setups we have tried.

CPUs 0 and 9 deal exclusively with NIC interrupts and don't serve packets, but they still are the most busy ones:

top(1) CPU usage screenshot; CPUs 0 and 9 have high SI values

RedHat's tuned was also used with the network-latency profile on.

To minimize the influence of nf_conntrack, NOTRACK rules were added.

Sysctls was tuned to support large number of tcp connections:

fs.file-max = 1024000
net.ipv4.ip_local_port_range = "2000 65535"
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_slow_start_after_idle = 0
net.ipv4.tcp_low_latency = 1

Footnotes

[1]Linux kernel documentation: RPS Configuration
[2]Linux kernel documentation: XPS Configuration

Mentions

—pk,  #linux #network #containers