Selecting Performance Monitoring Tools

Linux Performance Observability Tools by Brendan Gregg (CC BY-SA 4.0)

System monitoring is a helpful approach to provide the user with data regarding the actual timing behavior of the system. Users can perform further analysis using the data that these monitors provide. One of the goals of system monitoring is to determine whether the current execution meets the specified technical requirements.

These monitoring tools retrieve commonly viewed information, and can be used by way of the command line or a graphical user interface, as determined by the system administrator. These tools display information about the Linux system, such as free disk space, the temperature of the CPU, and other essential components, as well as networking information, such as the system IP address and current rates of upload and download.

Monitoring Tools

The Linux kernel maintains counterstructures for counting events, that increment when an event occurs. For example, disk reads and writes, and process system calls, are events that increment counters with values stored as unsigned integers. Monitoring tools read these counter values. These tools provide either per process statistics maintained in process structures, or system-wide statistics in the kernel. Monitoring tools are typically viewable by non-privileged users. The ps and top commands provide process statistics, including CPU and memory.

Monitoring Processes Using the ps Command

Troubleshooting a system requires understanding how the kernel communicates with processes, and how processes communicate with each other. At process creation, the system assigns a state to the process.

Use the ps aux command to list all users with extended user-oriented details; the resulting list includes the terminal from which processes are started, as well as processes without a terminal. A ? sign in the TTY column represents that the process did not start from a terminal.

[user@host]$ ps aux
USER   PID %CPU %MEM    VSZ   RSS TTY      STAT START TIME COMMAND
user  1350  0.0  0.2 233916  4808 pts/0    Ss   10:00   0:00 -bash
root  1387  0.0  0.1 244904  2808 ?        Ss   10:01 0:00 /usr/sbin/anacron -s
root  1410  0.0  0.0      0     0 ?        I    10:08   0:00 [kworker/0:2...
root  1435  0.0  0.0      0     0 ?        I    10:31   0:00 [kworker/1:1...
user  1436  0.0  0.2 266920  3816 pts/0    R+   10:48   0:00 ps aux

The Linux version of ps supports three option formats:

  • UNIX (POSIX) options, which may be grouped and must be preceded by a dash.
  • BSD options, which may be grouped and must not include a dash.
  • GNU long options, which are preceded by two dashes.

The output below uses the UNIX options to list every process with full details:

[user@host]$ ps -ef
UID        PID  PPID  C STIME TTY          TIME CMD
root         2     0  0 09:57 ?        00:00:00 [kthreadd]
root         3     2  0 09:57 ?        00:00:00 [rcu_gp]
root         4     2  0 09:57 ?        00:00:00 [rcu_par_gp]
...output omitted...

Key Columns in ps OutputPID

This column shows the unique process ID.TIME

This column shows the total CPU time consumed by the process in hours:minutes:seconds format, since the start of the process.%CPU

This column shows the CPU usage during the previous second as the sum across all CPUs expressed as a percentage.RSS

This column shows the non-swapped physical memory that a process consumes in kilobytes in the resident set size, RSS column.%MEM

This column shows the ratio of the process’ resident set size to the physical memory on the machine, expressed as a percentage.

Use the -p option together with the pidof command to list the sshd processes that are running.

[user@host ~]$ ps -p $(pidof sshd)
  PID TTY      STAT   TIME COMMAND
  756 ?        Ss     0:00 /usr/sbin/sshd -D [email protected]...
 1335 ?        Ss     0:00 sshd: user [priv]
 1349 ?        S      0:00 sshd: user@pts/0

Use the following command to list of all processes sorted by memory usage in descending order:

[user@host ~]$ ps ax --format pid,%mem,cmd --sort -%mem
  PID %MEM CMD
  713  1.8 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --logger=files
  715  1.8 /usr/libexec/platform-python -s /usr/sbin/firewalld --nofork --nopid
  753  1.5 /usr/libexec/platform-python -Es /usr/sbin/tuned -l -P
  687  1.2 /usr/lib/polkit-1/polkitd --no-debug
  731  0.9 /usr/sbin/NetworkManager --no-daemon
...output omitted...

Various other options are available for ps including the o option to customize the output and columns shown.

Monitoring Process Using top

The top command provides a real-time report of process activities with an interface for the user to filter and manipulate the monitored data. The command output shows a system-wide summary at the top and process listing at the bottom, sorted by the top CPU consuming task by default. The -n 1 option terminates the program after a single display of the process list. The following is an example output of the command:

[user@host ~]$ top -n 1
Tasks: 115 total,   1 running, 114 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  3.2 sy,  0.0 ni, 96.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   1829.0 total,   1426.5 free,    173.6 used,    228.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1495.8 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
    1 root      20   0  243968  13276   8908 S   0.0   0.7   0:01.86 systemd
    2 root      20   0       0      0      0 S   0.0   0.0   0:00.00 kthreadd
    3 root       0 -20       0      0      0 I   0.0   0.0   0:00.00 rcu_gp
...output omitted...

Useful Key Combinations to Sort FieldsRES

Use Shift+M to sort the processes based on resident memory.PID

Use Shift+N to sort the processes based on process ID.TIME+

Use Shift+T to sort the processes based on CPU time.

Press F and select a field from the list to use any other field for sorting.

IMPORTANT

The top command imposes a significant overhead on the system due to various system calls. While running the top command, the process running the top command is often the top CPU-consuming process.

Monitoring Memory Usage

The free command lists both free and used physical memory and swap memory. The -b-k-m-g options show the output in bytes, KB, MB, or GB, respectively. The -s option is passed as an argument that specifies the number of seconds between refreshes. For example, free -s 1 produces an update every 1 second.

[user@host ~]$ free -m
              total        used        free      shared  buff/cache   available
Mem:           1829         172        1427          16         228        1496
Swap:             0           0           0

The near zero values in the buff/cache and available columns indicate a low memory situation. If the available memory is more than 20% of the total, and the used memory is close to the total memory, then these values indicate a healthy system.

Monitoring File System Usage

One stable identifier that is associated with a file system is its UUID, a very long hexadecimal number that acts as a universally unique identifier. This UUID is part of the file system and remains the same as long as the file system is not recreated. The lsblk -fp command lists the full path of the device, along with the UUIDs and mount points, as well as the type of file system in the partition. If the file system is not mounted, the mount point displays as blank.

[user@host ~]$ lsblk -fp
NAME        FSTYPE LABEL UUID                                 MOUNTPOINT
/dev/vda
├─/dev/vda1 xfs          23ea8803-a396-494a-8e95-1538a53b821c /boot
├─/dev/vda2 swap         cdf61ded-534c-4bd6-b458-cab18b1a72ea [SWAP]
└─/dev/vda3 xfs          44330f15-2f9d-4745-ae2e-20844f22762d /
/dev/vdb
└─/dev/vdb1 xfs          46f543fd-78c9-4526-a857-244811be2d88

The findmnt command allows the user to take a quick look at what is mounted where, and with which options. Executing the findmnt command without any options lists out all the mounted file systems in a tree layout. Use the -s option to read the file systems from the /etc/fstab file. Use the -S option to search the file systems by the source disk.

[user@host ~]$ findmnt -S /dev/vda1
TARGET SOURCE    FSTYPE OPTIONS
/      /dev/vda1 xfs    rw,relatime,seclabel,attr2,inode64,noquota

The df command provides information about the total usage of the file systems. The -h option transforms the output into a human-readable form.

[user@host ~]$ df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        892M     0  892M   0% /dev
tmpfs           915M     0  915M   0% /dev/shm
tmpfs           915M   17M  899M   2% /run
tmpfs           915M     0  915M   0% /sys/fs/cgroup
/dev/vda1        10G  1.5G  8.6G  15% /
tmpfs           183M     0  183M   0% /run/user/1000

The du command displays the total size of all the files in a given directory and its subdirectories. The -s option suppresses the output of detailed information and displays only the total. Similar to the df -h command, the -h option displays the output into a human-readable form.

[user@host ~]$ du -sh /home/user
16K /home/user

Using GNOME System Monitor

The System Monitor available on the GNOME desktop provides statistical data about the system status, load, and processes, as well as the ability to manipulate those processes. Similar to other monitoring tools, such as the topps, and free commands, the System Monitor provides both the system-wide and per-process data. These monitoring tools retrieve commonly viewed information, and can be used by way of the command line or a graphical user interface, as determined by the system administrator. Use the gnome-system-monitor command to access the application from a command terminal.

To view the CPU usage, go to the Resources tab and look at the CPU History chart.

Figure 2.2: CPU usage history in System Monitor

The virtual memory is the sum of the physical memory and the swap space in a system. A running process maps the location in physical memory to files on disk. The memory map displays the total virtual memory consumed by a running process, which determines the memory cost of running that process instance. The memory map also displays the shared libraries used by the process.

Figure 2.3: Memory map of a process in System Monitor

To display the memory map of a process in System Monitor, locate a process in the Processes tab, right-click a process in the list, and select Memory Maps.

Deploying Kubernetes on bare metal with Rancher 2.0

Contents

  • Install Rancher server
  • Create a Kubernetes cluster
  • Add Kubernetes nodes
  • Install StorageOS as the Kubernetes storage class
  • Understand Nginx Ingress in Rancher

Install Rancher

Create a VM with Docker and Docker Compose installed and install Rancher 2.0 with docker compose:

  • Rancher docker-compose file: docker-compose.yaml
  • Run these commands to install Rancher with docker compose:
    • git clone https://github.com/polinchw/rancher-docker-compose
    • cd rancher-docker-compose
    • docker-compose up -d

Create your Kubernetes cluster with Rancher

Install a custom Kubernetes cluster with Rancher. Use the ‘Custom’ cluster.

Cluster!

Add Kubernetes nodes and join the Kubernetes cluster

Run the following commands on all the VMs that your Kubernetes cluster will run on. The final docker command will have the VM join the new Kubernetes cluster.

Replace the –server and –token with your Rancher server and cluster token.

#!/bin/bash

#sudo apt update
#sudo apt -y dist-upgrade

#Ubuntu (Docker install)
#sudo apt -y install docker.io

sudo apt -y install linux-image-extra-$(uname -r)

#Debian 9 (Docker install)
#sudo apt -y install apt-transport-https ca-certificates curl gnupg2 software-properties-common
#curl -fsSL https://download.docker.com/linux/debian/gpg | sudo apt-key add -
#sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
#sudo apt update
#sudo apt -y install docker-ce

sudo mkdir -p /etc/systemd/system/docker.service.d/
sudo cat <<EOF > /etc/systemd/system/docker.service.d/mount_propagation_flags.conf
[Service]
MountFlags=shared
EOF

sudo systemctl daemon-reload
sudo systemctl restart docker.service

#This is dependent on your Rancher server
sudo docker run -d --privileged --restart=unless-stopped --net=host -v /etc/kubernetes:/etc/kubernetes -v /var/run:/var/run rancher/rancher-agent:v2.1.0-rc9 --server https://75.77.159.159 --token rb8k8kkqw55jqnqbbf4ssdjqtw6hndhfxxcghgv8257kx4p6qsqq55 --ca-checksum 641b2888ce3f1091d20149a495d10457154428f440475b42291b6af1b6c0dd06 --etcd --controlplane --worker

Download the kub config file for the cluster

Helloservice!

After you download the kub config file you can use it by running this command:

export KUBECONFIG=$HOME/.kube/rancher-config

Install Helm on the cluster

git clone https://github.com/polinchw/set-up-tiller

cd set-up-tiller

chmod u+x set-up-tiller.sh

./set-up-tiller.sh

helm init --service-account tiller


Install StorageOS Helm Chart

helm repo add storageos https://charts.storageos.com
helm install --name storageos --namespace storageos-operator --version 1.1.3 storageos/storageoscluster-operator

Add the Storage OS Secret

apiVersion: v1
kind: Secret
metadata:
  name: storageos-api
  namespace: default
  labels:
    app: storageos
type: kubernetes.io/storageos
data:
  # echo -n '<secret>' | base64
  apiUsername: c3RvcmFnZW9z
  apiPassword: c3RvcmFnZW9z


Add the StorageOSCluster

apiVersion: storageos.com/v1
kind: StorageOSCluster
metadata:
  name: example-storageos
  namespace: default
spec:
  secretRefName: storageos-api
  secretRefNamespace: default
  csi:
    enable: true


Set StorageOS as the default storage class

kubectl patch storageclass fast -p ‘{“metadata”: {“annotations”:{“storageclass.kubernetes.io/is-default-class”:”true”}}}’

Using the default Nginx Igress

Rancher automatically installs the nginx ingress controller on all the nodes in the cluster.
If you are able to expose one of the VMs in the cluster to the outside world with a public IP then you can connect to the ingress based services on ports 80 and 443.

Any app you want to be accessible through the default nginx ingress must be added to the ‘default’ project in Rancher.

Linux Name spaces

Namespaces in Linux are heavily used by many applications, e.g. LXC, Docker and Openstack.
Question: How to find all existing namespaces in a Linux system?

The answer is quite difficult, because it’s easy to hide a namespace or more exactly make it difficult to find them.

Exploring the system

In the basic/default setup Ubuntu 12.04 and higher provide namespaces for

  • ipc for IPC objects and POSIX message queues
  • mnt for filesystem mountpoints
  • net for network abstraction (VRF)
  • pid to provide a separated, isolated process ID number space
  • uts to isolate two system identifiers — nodename and domainname – to be used by uname

These namespaces are shown for every process in the system. if you execute as rootls -lai /proc/1/nsShell

ls -lai /proc/1/ns
 
60073292 dr-x--x--x 2 root root 0 Dec 15 18:23 .
   10395 dr-xr-xr-x 9 root root 0 Dec  4 11:07 ..
60073293 lrwxrwxrwx 1 root root 0 Dec 15 18:23 ipc -> ipc:[4026531839]
60073294 lrwxrwxrwx 1 root root 0 Dec 15 18:23 mnt -> mnt:[4026531840]
60073295 lrwxrwxrwx 1 root root 0 Dec 15 18:23 net -> net:[4026531968]
60073296 lrwxrwxrwx 1 root root 0 Dec 15 18:23 pid -> pid:[4026531836]
60073297 lrwxrwxrwx 1 root root 0 Dec 15 18:23 uts -> uts:[4026531838]

you get the list of attached namespaces of the init process using PID=1. Even this process has attached namespaces. These are the default namespaces for ipc, mnt, net, pid and uts. For example, the default net namespace is using the ID net:[4026531968]. The number in the brackets is a inode number.

In order to find other namespaces with attached processes in the system, we use these entries of the PID=1 as a reference. Any process or thread in the system, which has not the same namespace ID as PID=1 is not belonging to the DEFAULT namespace.

Additionally, you find the namespaces created by „ip netns add <NAME>“ by default in /var/run/netns/ .

The python code

The python code below is listing all non default namespaces in a system. The program flow is

  • Get the reference namespaces from the init process (PID=1). Assumption: PID=1 is assigned to the default namespaces supported by the system
  • Loop through /var/run/netns/ and add the entries to the list
  • Loop through /proc/ over all PIDs and look for entries in /proc/<PID>/ns/ which are not the same as for PID=1 and add then to the list
  • Print the result

List all non default namespaces in a systemPython

#!/usr/bin/python
#
# List all Namespaces (works for Ubuntu 12.04 and higher)
#
# (C) Ralf Trezeciak    2013-2014
#
#
#    This program is free software: you can redistribute it and/or modify
#    it under the terms of the GNU General Public License as published by
#    the Free Software Foundation, either version 3 of the License, or
#    (at your option) any later version.
#
#    This program is distributed in the hope that it will be useful,
#    but WITHOUT ANY WARRANTY; without even the implied warranty of
#    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#    GNU General Public License for more details.
#
#    You should have received a copy of the GNU General Public License
#    along with this program.  If not, see <http://www.gnu.org/licenses/>.
#
import os
import fnmatch
 
if os.geteuid() != 0:
    print "This script must be run as root\nBye"
    exit(1)
 
def getinode( pid , type):
    link = '/proc/' + pid + '/ns/' + type
    ret = ''
    try:
        ret = os.readlink( link )
    except OSError as e:
        ret = ''
        pass
    return ret
 
#
# get the running command
def getcmd( p ):
    try:
        cmd = open(os.path.join('/proc', p, 'cmdline'), 'rb').read()
        if cmd == '':
            cmd = open(os.path.join('/proc', p, 'comm'), 'rb').read()
        cmd = cmd.replace('\x00' , ' ')
        cmd = cmd.replace('\n' , ' ')
        return cmd
    except:
        return ''
#
# look for docker parents
def getpcmd( p ):
    try:
        f = '/proc/' + p + '/stat'
        arr = open( f, 'rb').read().split()
        cmd = getcmd( arr[3] )
        if cmd.startswith( '/usr/bin/docker' ):
            return 'docker'
    except:
        pass
    return ''
#
# get the namespaces of PID=1
# assumption: these are the namespaces supported by the system
#
nslist = os.listdir('/proc/1/ns/')
if len(nslist) == 0:
    print 'No Namespaces found for PID=1'
    exit(1)
#print nslist
#
# get the inodes used for PID=1
#
baseinode = []
for x in nslist:
    baseinode.append( getinode( '1' , x ) )
#print "Default namespaces: " , baseinode
err = 0
ns = []
ipnlist = []
#
# loop over the network namespaces created using "ip"
#
try:
    netns = os.listdir('/var/run/netns/')
    for p in netns:
        fd = os.open( '/var/run/netns/' + p, os.O_RDONLY )
        info = os.fstat(fd)
        os.close( fd)
        ns.append( '-- net:[' + str(info.st_ino) + '] created by ip netns add ' + p )
        ipnlist.append( 'net:[' + str(info.st_ino) + ']' )
except:
    # might fail if no network namespaces are existing
    pass
#
# walk through all pids and list diffs
#
pidlist = fnmatch.filter(os.listdir('/proc/'), '[0123456789]*')
#print pidlist
for p in pidlist:
    try:
        pnslist = os.listdir('/proc/' + p + '/ns/')
        for x in pnslist:
            i = getinode ( p , x )
            if i != '' and i not in baseinode:
                cmd = getcmd( p )
                pcmd = getpcmd( p )
                if pcmd != '':
                    cmd = '[' + pcmd + '] ' + cmd
                tag = ''
                if i in ipnlist:
                    tag='**' 
                ns.append( p + ' ' + i + tag + ' ' + cmd)
    except:
        # might happen if a pid is destroyed during list processing
        pass
#
# print the stuff
#
print '{0:>10}  {1:20}  {2}'.format('PID','Namespace','Thread/Command')
for e in ns:
    x = e.split( ' ' , 2 )
    print '{0:>10}  {1:20}  {2}'.format(x[0],x[1],x[2][:60])
#

Copy the script to your system as listns.py , and run it as root using python listns.py

       PID  Namespace             Thread/Command
        --  net:[4026533172]      created by ip netns add qrouter-c33ffc14-dbc2-4730-b787-4747
        --  net:[4026533112]      created by ip netns add qrouter-5a691ed3-f6d3-4346-891a-3b59
        --  net:[4026533050]      created by ip netns add qdhcp-02e848cb-72d0-49df-8592-2f7a03
        --  net:[4026532992]      created by ip netns add qdhcp-47cfcdef-2b34-43b8-a504-6720e5
       297  mnt:[4026531856]      kdevtmpfs 
      3429  net:[4026533050]**    dnsmasq --no-hosts --no-resolv --strict-order --bind-interfa
      3429  mnt:[4026533108]      dnsmasq --no-hosts --no-resolv --strict-order --bind-interfa
      3446  net:[4026532992]**    dnsmasq --no-hosts --no-resolv --strict-order --bind-interfa
      3446  mnt:[4026533109]      dnsmasq --no-hosts --no-resolv --strict-order --bind-interfa
      3486  net:[4026533050]**    /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
      3486  mnt:[4026533107]      /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
      3499  net:[4026532992]**    /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
      3499  mnt:[4026533110]      /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
      4117  net:[4026533112]**    /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
      4117  mnt:[4026533169]      /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
     41998  net:[4026533172]**    /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil
     41998  mnt:[4026533229]      /usr/bin/python /usr/bin/neutron-ns-metadata-proxy --pid_fil

The example above is from an Openstack network node. The first four entries are entries created using the command ip. The entry PID=297 is a kernel thread and no user process. All other processes listed, are started by Openstack agents. These process are using network and mount namespaces. PID entries marked with ‚**‘ have a corresponding entry created with the ip command.

When a docker command is started, the output is:

PID  Namespace             Thread/Command
        --  net:[4026532676]      created by ip netns add test
        35  mnt:[4026531856]      kdevtmpfs 
      6189  net:[4026532585]      [docker] /bin/bash 
      6189  uts:[4026532581]      [docker] /bin/bash 
      6189  ipc:[4026532582]      [docker] /bin/bash 
      6189  pid:[4026532583]      [docker] /bin/bash 
      6189  mnt:[4026532580]      [docker] /bin/bash 

The docker child running in the namespaces is marked using [docker].

On a node running mininet and a simple network setup the output looks like :

 exampleShell
       PID  Namespace             Thread/Command
        14  mnt:[4026531856]      kdevtmpfs 
      1198  net:[4026532150]      bash -ms mininet:h1 
      1199  net:[4026532201]      bash -ms mininet:h2 
      1202  net:[4026532252]      bash -ms mininet:h3 
      1203  net:[4026532303]      bash -ms mininet:h4

Googles Chrome Browser

Googles Chrome Browser makes extensive use of the linux namespaces. Start Chrome and run the python script. The output looks like:Chrome’s namespaces

 PID  Namespace             Thread/Command
        63  mnt:[4026531856]      kdevtmpfs 
     30747  net:[4026532344]      /opt/google/chrome/chrome --type=zygote 
     30747  pid:[4026532337]      /opt/google/chrome/chrome --type=zygote 
     30753  net:[4026532344]      /opt/google/chrome/nacl_helper 
     30753  pid:[4026532337]      /opt/google/chrome/nacl_helper 
     30754  net:[4026532344]      /opt/google/chrome/chrome --type=zygote 
     30754  pid:[4026532337]      /opt/google/chrome/chrome --type=zygote 
     30801  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30801  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30807  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30807  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30813  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30813  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30820  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30820  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30829  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30829  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30835  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30835  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30841  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30841  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30887  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30887  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30893  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30893  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30901  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30901  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30910  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30910  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30915  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30915  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30923  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30923  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30933  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30933  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30938  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30938  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30944  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     30944  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     31271  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     31271  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     31538  net:[4026532344]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for
     31538  pid:[4026532337]      /opt/google/chrome/chrome --type=renderer --lang=en-US --for

Chrome makes use of pid and network namespaces to restrict the access of subcomponents. The network namespace does not have a link in /var/run/netns/.

Conclusion

It’s quite hard to explore the Linux namespace. There is a lot of documentation flowing around. I did not find any simple program to look for namespaces in the system. So I wrote one.

The script cannot find a network namespace, which do not have any process attached to AND which has no reference in /var/run/netns/. If root creates the reference inode somewhere else in the filesystem, you may only detect network ports (ovs port, veth port on one side), which are not attached to a known network namespace –> an unknown guest might be on your system using a „hidden“ (not so easy to find) network namespace.

And — Linux namespaces can be stacked.

Demystifying Containers – Part II: Container Runtimes

This second blog post (and talk) is primary scoped to container runtimes, where we will start with their historic origins before digging deeper into two dedicated projects: runc and CRI-O. We will initially build up a great foundation about how container runtimes work under the hood by starting with the lower level runtime runc. Afterwards, we will utilize the more advanced runtime CRI-O to run Kubernetes native workloads, but without even running Kubernetes at all.

Introduction

In the previous part of this series we discussed Linux Kernel Namespaces and everything around to build up a foundation about containers and their basic isolation techniques. Now we want to dive deeper into answering the question: “How to actually run containers?”. We will do so without being overwhelmed by the details of Kubernetes’ features or security related topics, which will be part of further blog posts and talks.

What is a Container Runtime?

Applications and their required or not required use cases are contentiously discussed topics in the UNIX world. The mainUNIX philosophy propagates minimalism and modular software parts which should fit well together in a complete system. Great examples which follow these philosophical aspects are features like the UNIX pipe or text editors like vim. These tools solve one dedicated task as best as they can and are tremendously successful at it. On the other side, there are projects like systemd or cmake, which do not follow the same approach and implement a richer feature set over time. In the end we have multiple views and opinions about answers to questions like ”How should an initialization system look like?” or ”What should a build system do?”. If these multi-opinionated views mix up with historical events, then answering a simple question might need more explanations than it should.

Now, welcome to the world of containers!

Lots of applications can run containers, whereas every application would have a sightly different opinion about what a container runtime should do and support. For example, systemd is able to run containers via systemd-nspawn, and NixOS has integrated container management as well. Not to mention all the other existing container runtimes like CRI-OKata ContainersFirecrackergVisorcontainerdLXCruncNabla Containers and many more. A lot of them are now part of theCloud Native Computing Foundation (CNCF) and their huge landscape, whereas someone might ask: ”Why do so many container runtimes exist?”.

Per usual for our series of blog posts, we should start from the historical beginning.

A Brief History

After the invention of cgroups back in 2008, a project called Linux Containers (LXC) started to pop-up in the wild, which should revolutionize the container world. LXC combined cgroup and namespace technologies to provide an isolated environment for running applications. You may know that we sometimes live in a parallel world. This means that Google started their own containerization project in 2007 called Let Me Contain That For You (LMCTFY), which works mainly at the same level as LXC does. With LMCTFY, Google tried to provide a stable and API driven configuration without users having to understand the details of cgroups and its internals.

If we now look back into 2013 we see that there was a tool written called Docker, which was built on top of the already existing LXC stack. One invention of Docker was that the user is now able to package containers into images to move them between machines. Docker were the fist ones who tried to make containers a standard software unit, as they state in their ”Standard Container Manifesto”.

Some years later they began to work on libcontainer, a Go native way to spawn and manage containers. LMCTFY was abandoned during that time too, whereas the core concepts and major benefits of LMCTFY were ported into libcontainer and Docker.

We are now back in 2015, where projects like Kubernetes hit version 1.0. A lot of stuff was ongoing during that time: The CNCF was founded as part of the Linux Foundation with the target to promote containers. The Open Container Initiative (OCI)was founded 2015 as well, as an open governance structure around the container ecosystem.

Their main target is to create open industry standards around container formats and runtimes. We were now in a state where containers are used, in terms of their popularity, side by side with classic Virtual Machines (VMs). There was a need for a specification of how containers should run, which resulted in the OCI Runtime Specification. Runtime developers should now be able to have a well-defined API to develop their container runtime. The libcontainer project was donated to the OCI during that time, whereas a new tool called runc was born as part of that. With runc it was now possible to directly interact with libcontainer, interpret the OCI Runtime Specification and run containers from it.

As of today, runc is one of the most popular projects in the container ecosystem and is used in a lot of other projects like containerd (used by Docker), CRI-O and podman. Other projects adopted the OCI Runtime Specification as well. For example Kata Containers makes it possible to build and run secure containers including lightweight virtual machines that feel and perform like containers, but provide stronger workload isolation using hardware virtualization technology as a second layer of defense.

Let’s dig more into the OCI Runtime Specification to get a better understanding about how a container runtime works under the hood.

Running Containers

runc

The OCI Runtime Specification provides information about the configuration, execution environment and overall life cycle of a container. A configuration is mainly a JSON file that contains all necessary information to enable the creation of a container on different target platforms like Linux, Windows or Virtual Machines (VMs).

An example specification can be easily generated with runc:

> runc spec
> cat config.json
{
  "ociVersion": "1.0.0",
  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["sh"],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm"
    ],
    "cwd": "/",
    "capabilities": {
      "bounding": ["CAP_AUDIT_WRITE", "CAP_KILL", "CAP_NET_BIND_SERVICE"],
      [...]
    },
    "rlimits": [ { "type": "RLIMIT_NOFILE", "hard": 1024, "soft": 1024 } ],
    "noNewPrivileges": true
  },
  "root": { "path": "rootfs", "readonly": true },
  "hostname": "runc",
  "mounts": [
    {
      "destination": "/proc",
      "type": "proc",
      "source": "proc"
    },
    [...]
  ],
  "linux": {
    "resources": { "devices": [ { "allow": false, "access": "rwm" } ] },
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "ipc" },
      { "type": "uts" },
      { "type": "mount" }
    ],
    "maskedPaths": [
      "/proc/kcore",
      [...]
    ],
    "readonlyPaths": [
      "/proc/asound",
      [...]
    ]
  }
}

This file mainly contains all necessary information for runc to get started with running containers. For example, we have attributes about the running process, the defined environment variables, the user and group IDs, needed mount points and the Linux namespaces to be set up. One thing is still missing to get started running containers: We need an appropriate root file-system (rootfs). We already discovered in the past blog post how to obtain it from an already existing container image:

> skopeo copy docker://opensuse/tumbleweed:latest oci:tumbleweed:latest
[output removed]
> sudo umoci unpack --image tumbleweed:latest bundle
[output removed]

Interestingly, the unpacked container image already includes the Runtime Specification we need to run the bundle:

> sudo chown -R $(id -u) bundle
> cat bundle/config.json
{
  "ociVersion": "1.0.0",
  "process": {
    "terminal": true,
    "user": { "uid": 0, "gid": 0 },
    "args": ["/bin/bash"],
    "env": [
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "TERM=xterm",
      "HOME=/root"
    ],
    "cwd": "/",
    "capabilities": { [...] },
    "rlimits": [...]
  },
  "root": { "path": "rootfs" },
  "hostname": "mrsdalloway",
  "mounts": [...],
  "annotations": {
    "org.opencontainers.image.title": "openSUSE Tumbleweed Base Container",
    "org.opencontainers.image.url": "https://www.opensuse.org/",
    "org.opencontainers.image.vendor": "openSUSE Project",
    "org.opencontainers.image.version": "20190517.6.190",
    [...]
  },
  "linux": {
    "resources": { "devices": [ { "allow": false, "access": "rwm" } ] },
    "namespaces": [
      { "type": "pid" },
      { "type": "network" },
      { "type": "ipc" },
      { "type": "uts" },
      { "type": "mount" }
    ]
  }
}

There are now some annotations included beside the usual fields we already know from running runc spec. These can be used to add arbitrary metadata to the container, which can be utilized by higher level runtimes to add additional information to the specification.

Let’s create a new container from the bundle with runc. Before actually calling out to runc, we have to setup a receiver terminal to be able to interact with the container. For this, we can use the recvtty tool included in the runc repository:

> go get github.com/opencontainers/runc/contrib/cmd/recvtty
> recvtty tty.sock

In another terminal, we now call runc create with specifying the bundle and terminal socket:

> sudo runc create -b bundle --console-socket $(pwd)/tty.sock container

No further output, so what happened now? It seems like we have created a new container in created state:

> sudo runc list
ID          PID         STATUS      BUNDLE      CREATED                          OWNER
container   29772       created     /bundle     2019-05-21T08:35:51.382141418Z   root

The container seems to be not running, but what is running inside?

> sudo runc ps container
UID        PID  PPID  C STIME TTY          TIME CMD
root     29772     1  0 10:35 ?        00:00:00 runc init

The runc init command sets up a fresh environment with all necessary namespaces and launches a new initial process. The main process /bin/bash does not run yet inside the container, but we are still able to execute further processes within the container:

> sudo runc exec -t container echo "Hello, world!"
> Hello, world!

The created state of a container provides a nice environment to setup networking for example. To actually do something within the container, we have to bring it into the running state. This can be done via runc start:

> sudo runc start container

In the terminal where the recvtty process is running, a new bash shell session should now pop up:

mrsdalloway:/ $
mrsdalloway:/ $ ps aux
ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   5156  4504 pts/0    Ss   10:28   0:00 /bin/bash
root        29  0.0  0.0   6528  3372 pts/0    R+   10:32   0:00 ps aux

Nice, the container seems to be running. We can now utilize runc to inspect the container’s state:

> sudo runc list
ID          PID         STATUS      BUNDLE      CREATED                          OWNER
container   4985        running     /bundle     2019-05-20T12:14:14.232015447Z   root
> sudo runc ps container
UID        PID  PPID  C STIME TTY          TIME CMD
root      6521  6511  0 14:25 pts/0    00:00:00 /bin/bash

The runc init process has gone and now only the actual /bin/bash process exists within the container. We can also do some basic life cycle management with the container:

> sudo runc pause container

It should now be impossible to get any output from the running container in the recvtty session. To resume the container, simply call:

> sudo runc resume container

Everything we tried to type before should now pop up in the resumed container terminal. If we need more information about the container, like the CPU and memory usage, then we can retrieve them via the runc events API:

> sudo runc events container
{...}

The output is a bit hard to read, so let’s reformat it and strip some fields:

{
  "type": "stats",
  "id": "container",
  "data": {
    "cpu": {
      "usage": {
        "total": 31442016,
        "percpu": [ 5133429, 5848165, 827530, ... ],
        "kernel": 20000000,
        "user": 0
      },
      "throttling": {}
    },
    "memory": {
      "usage": {
        "limit": 9223372036854771712,
        "usage": 1875968,
        "max": 6500352,
        "failcnt": 0
      },
      "swap": { "limit": 0, "failcnt": 0 },
      "kernel": {
        "limit": 9223372036854771712,
        "usage": 311296,
        "max": 901120,
        "failcnt": 0
      },
      "kernelTCP": { "limit": 9223372036854771712, "failcnt": 0 },
      "raw": {
        "active_anon": 1564672,
        [...]
      }
    },
    "pids": { "current": 1 },
    "blkio": {},
    "hugetlb": { "1GB": { "failcnt": 0 }, "2MB": { "failcnt": 0 } },
    "intel_rdt": {}
  }
}

We can see that we are able to retrieve detailed runtime information about the container.

To stop the container, we simply exit the recvtty session. Afterwards the container can be removed with runc delete:

> sudo runc list
ID          PID         STATUS      BUNDLE      CREATED                         OWNER
container   0           stopped     /bundle     2019-05-21T10:28:32.765888075Z  root
> sudo runc delete container
> sudo runc list
ID          PID         STATUS      BUNDLE      CREATED     OWNER

Containers in the stopped state cannot run again, so they have to be recreated from a fresh state. As already mentioned, the extracted bundle contains the necessary config.json file beside the rootfs, which will be used by runc to setup the container. We could for example modify the initial run command of the container by executing:

> cd bundle
> jq '.process.args = ["echo", "Hello, world!"]' config.json | sponge config.json
> sudo runc run container
> Hello, world!

We have nearly every freedom by editing the rootfs or the config.json. So we could tear down the PID namespace isolation between the container and the host:

> jq '.process.args = ["ps", "a"] | del(.linux.namespaces[0])' config.json | sponge config.json
> sudo runc run container
16583 ?        S+     0:00 sudo runc run container
16584 ?        Sl+    0:00 runc run container
16594 pts/0    Rs+    0:00 ps a
[output truncated]

In the end runc is a pretty low level runtime, whereas improper configuration and usage can lead into serious security concerns. Truly, runc has native support for security enhancements like seccompSecurity-Enhanced Linux (SELinux) andAppArmor but these features should be used by higher level runtimes to ensure correct usage in production. It is also worth mentioning that it is possible to run containers in rootless mode via runc to security harden the deployment even further. We will cover these topics in future blog posts as well, but for now that should suffice on that level.

Another drawback in running containers only with runc would be that we have to manually set up the networking to the host to reach out to the internet or other containers. In order to do that we could use the Runtime Specification Hooks feature to set up a default bridge before actually starting the container.

But why don’t we leave this job to a higher level runtime as well? Let’s go for that and move on.

The Kubernetes Container Runtime Interface (CRI)

Back in 2016, the Kubernetes project announced the implementation of the Container Runtime Interface (CRI), which provides a standard API for container runtimes to work with Kubernetes. This interface enables users to exchange the runtime in a cluster with ease.

How does the API work? At the bottom line of every Kubernetes cluster runs a piece of software called the kubelet, which has the main job of keeping container workloads running and healthy. The kubelet connects to a gRPC server on startup and expects a predefined API there. For example, some service definitions of the API look like this:

// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
    rpc CreateContainer (...) returns (...) {}
    rpc ListContainers  (...) returns (...) {}
    rpc RemoveContainer (...) returns (...) {}
    rpc StartContainer  (...) returns (...) {}
    rpc StopContainer   (...) returns (...) {}

That seems to be pretty much what we already did with runc, managing the container life cycle. If we look further at the API, we see this:

    rpc ListPodSandbox  (...) returns (...) {}
    rpc RemovePodSandbox(...) returns (...) {}
    rpc RunPodSandbox   (...) returns (...) {}
    rpc StopPodSandbox  (...) returns (...) {}
}

What does “sandbox” mean? Containers should already be some kind of sandbox, right? Yes, but in the Kubernetes worldPods can consist of multiple containers, whereas this abstract hierarchy has to be mapped into a simple list of containers. Because of that, every creation of a Kubernetes Pod starts with the setup of a so called PodSandbox. Every container running inside the Pod is attached to this sandbox, so the containers inside can share common resources, like their network interfaces for example. runc alone does not provide such features out of the box, so we have to use a higher level runtime to achieve our goal.

CRI-O

CRI-O is a higher level container runtime which has been written on purpose to be used with the Kubernetes CRI. The name originates from the combination of the Container Runtime Interface and the Open Container Initiative. Isn’t that simple? CRI-O’s journey started as Kubernetes incubator project back in 2016 under the name Open Container Initiative Daemon (OCID). Version 1.0.0 has been released one year later in 2017 and follows the Kubernetes release cycles from that day on. This means for example, that the Kubernetes version 1.15 can be safely used together with CRI-O 1.15 and so on.

The implementation of CRI-O follows the main UNIX philosophy and tends to be a lightweight alternative to Docker or containerd when it comes to running production-ready workloads inside of Kubernetes. It is not meant to be a developers-facing tool which can be used from the command line. CRI-O has only one major task: Fulfilling the Kubernetes CRI. To achieve that, it utilizes runc for basic container management in the back, whereas the gRPC server provides the API in the front end. Everything in between is done either by CRI-O itself or by core libraries like containers/storage or containers/image. But in the end it doesn’t mean that we cannot play around with it, so let’s give it a try.

I prepared a container image called “crio-playground” to get started with CRI-O in an efficient manner. This image contains all necessary tools, example files and a working CRI-O instance running in the background. To start a privileged container running the crio-playground, simply execute:

> sudo podman run --privileged -h crio-playground -it saschagrunert/crio-playground
crio-playground:~ $

From now on we will use a tool called crictl to interface with CRI-O and its Container Runtime Interface implementation. crictl allows us to use YAML representations of the CRI API requests to send them to CRI-O. For example, we can create a new PodSandbox with the sandbox.yml lying around in the current working directory of the playground:

metadata:
  name: sandbox
  namespace: default
dns_config:
  servers:
    - 8.8.8.8

To create the sandbox in the running crio-playground container, we now execute:

crio-playground:~ $ crictl runp sandbox.yml
5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336

Let’s store the identifier of the sandbox as $POD_ID environment variable for later usage as well:

crio-playground:~ $ export POD_ID=5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336

If we now run crictl pods we can see that we finally have one PodSandbox up and running:

crio-playground:~ $ crictl pods
POD ID              CREATED             STATE               NAME                NAMESPACE           ATTEMPT
5f2b94f74b28c       43 seconds ago      Ready               sandbox             default             0

But what’s inside our sandbox? We surely can examine the sandbox further by using runc:

crio-playground:~ $ runc list
ID                                                                 PID         STATUS      BUNDLE                                                                                                             CREATED                          OWNER
5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336   80          running     /run/containers/storage/vfs-containers/5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336/userdata   2019-05-23T13:43:38.798531426Z   root

The sandbox seems to run in a dedicated bundle under /run/containers.

crio-playground:~ $ runc ps $POD_ID
UID        PID  PPID  C STIME TTY          TIME CMD
root        80    68  0 13:43 ?        00:00:00 /pause

Interestingly, there is only one process running inside the sandbox, called pause. As the source code of pause indicates, the main task of this process is to keep the environment running and react to incoming signals. Before we actually create our workload within that sandbox, we have to pre-pull the image we want to run. A trivial example would be to run a web server, so let’s retrieve a nginx image by calling:

crio-playground:~ $ crictl pull nginx:alpine
Image is up to date for docker.io/library/nginx@sha256:0fd68ec4b64b8dbb2bef1f1a5de9d47b658afd3635dc9c45bf0cbeac46e72101

Now let’s create a very simple container definition in YAML, like we did for the sandbox:

metadata:
  name: container
image:
  image: nginx:alpine

And now, let’s kick off the container. For that we have to provide the hash of the sandbox as well as the YAML definitions of the sandbox and container:

crio-playground:~ $ crictl create $POD_ID container.yml sandbox.yml
b205eb2c6abec3e7ade72e0cea09d827968a4c1089483cab06bdf0f4ee82ff0c

Seems to work! Let’s store the container identifier as $CONTAINER_ID for later reuse as well:

crio-playground:~ $ export CONTAINER_ID=b205eb2c6abec3e7ade72e0cea09d827968a4c1089483cab06bdf0f4ee82ff0c

What would you expect if we now check out the status of our two running containers while keeping the CRI API in mind? Correct, the container should be in the created state:

crio-playground:~ $ runc list
ID                                                                 PID         STATUS      BUNDLE                                                                                                             CREATED                          OWNER
5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336   80          running     /run/containers/storage/vfs-containers/5f2b94f74b28c092021ad8eeae4903ada4b1ef306adf5eaa0e985672363d6336/userdata   2019-05-23T13:43:38.798531426Z   root
b205eb2c6abec3e7ade72e0cea09d827968a4c1089483cab06bdf0f4ee82ff0c   343         created     /run/containers/storage/vfs-containers/b205eb2c6abec3e7ade72e0cea09d827968a4c1089483cab06bdf0f4ee82ff0c/userdata   2019-05-23T14:08:53.701174406Z   root

And, like in our previous runc example, the container waits in runc init:

crio-playground:~ $ runc ps $CONTAINER_ID
UID        PID  PPID  C STIME TTY          TIME CMD
root       343   331  0 14:08 ?        00:00:00 /usr/sbin/runc init

crictl shows the container in created as well:

crio-playground:~ $ crictl ps -a
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
b205eb2c6abec       nginx:alpine        13 minutes ago      Created             container           0                   5f2b94f74b28c

Now we have to start the workload to get it into the running state:

crio-playground:~ $ crictl start $CONTAINER_ID
b205eb2c6abec3e7ade72e0cea09d827968a4c1089483cab06bdf0f4ee82ff0c

This should be successful, too. Let’s verify if all processes are running correctly:

crio-playground:~ $ crictl ps
CONTAINER ID        IMAGE               CREATED             STATE               NAME                ATTEMPT             POD ID
b205eb2c6abec       nginx:alpine        15 minutes ago      Running             container           0                   5f2b94f74b28c

Inside the container should now run an nginx web server:

crio-playground:~ $ runc ps $CONTAINER_ID
UID        PID  PPID  C STIME TTY          TIME CMD
root       343   331  0 14:08 ?        00:00:00 nginx: master process nginx -g daemon off;
100        466   343  0 14:24 ?        00:00:00 nginx: worker process

But how to reach the web servers content now? We did not expose any ports or other advanced configuration for the container, so it should be fairly isolated from the host. The solution lies down in the container networking. Because we use a bridged network configuration in the crio-playground, we can simply access the containers network address. To get these we can exec into the container and list the network interfaces:

crio-playground:~ $ crictl exec $CONTAINER_ID ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
3: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP
    link/ether 16:04:8c:44:00:59 brd ff:ff:ff:ff:ff:ff
    inet 172.0.0.2/16 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::1404:8cff:fe44:59/64 scope link
       valid_lft forever preferred_lft forever

And now just query the inet address for eth0:

crio-playground:~ $ curl 172.0.0.2
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
[output truncated]

Hooray, it works! We successfully run a Kubernetes workload without running Kubernetes!

The overall Kubernetes story about Network Plugins or the Container Network Interface (CNI) is worth another blog post, but that’s a different story and we stop right here with all the magic.

Conclusion

And that’s a wrap for this part of the blog series about the demystification of containers. We discovered the brief history of container runtimes and had the chance to run containers with the low level runtime runc as well as the higher level runtime CRI-O. I can really recommend to have a closer look at the OCI runtime specification and test different configurations within the crio-playground environment. For sure we will see CRI-O in the future again when we talk about container-related topics like security or networking. Besides that, we will have the chance to explore different tools like podmanbuildah or skopeo, which provide more advanced container management solutions. I really hope you enjoyed the read and will continue following my journey into future parts of this series. Feel free to drop me a line anywhere you can find me on the internet. Stay tuned!

You can find all necessary resources about this series on GitHub.

Demystifying Containers – Part I: Kernel Space

This series of blog posts and corresponding talks aims to provide you with a pragmatic view on containers from a historic perspective. Together we will discover modern cloud architectures layer by layer, which means we will start at the Linux Kernel level and end up at writing our own secure cloud native applications.

Simple examples paired with the historic background will guide you from the beginning with a minimal Linux environment up to crafting secure containers, which fit perfectly into todays’ and futures’ orchestration world. In the end it should be much easier to understand how features within the Linux kernel, container tools, runtimes, software defined networks and orchestration software like Kubernetes are designed and how they work under the hood.


Part I: Kernel Space

This first blog post (and talk) is scoped to Linux kernel related topics, which will provide you with the necessary foundation to build up a deep understanding about containers. We will gain an insight about the history of UNIX, Linux and talk about solutions like chroot, namespaces and cgroups combined with hacking our own examples. Besides this we will peel some containers to get a feeling about future topics we will talk about.

Introduction

If we are talking about containers nowadays, most people tend to think of the big blue whale or the white steering wheel on the blue background.

Let’s put these thoughts aside and ask ourselves: What are containers in detail? If we look at the corresponding documentation of Kubernetes we only find explanations about “Why to use containers?“ and lots of references to Docker. Docker itself explains containers as “a standard unit of software“. Their explanations provide a general overview but do not reveal much of the underlying “magic“. Eventually, people tend to imagine containers as cheap virtual machines (VMs), which technically does not come close to the real world. This could be reasoned since the word “container” does not mean anything precisely at all. The same applies to the word “pod” in the container orchestration ecosystem.

If we strip it down then containers are only isolated groups of processes running on a single host, which fulfill a set of “common” features. Some of these fancy features are built directly into the Linux kernel and mostly all of them have different historical origins.

So containers have to fulfill four major requirements to be acceptable as such:

  1. Not negotiable: They have to run on a single host. Okay, so two computers cannot run a single container.
  2. Clearly: They are groups of processes. You might know that Linux processes live inside a tree structure, so we can say containers must have a root process.
  3. Okay: They need to be isolated, whatever this means in detail.
  4. Not so clear: They have to fulfill common features. Features in general seem to change over time, so we have to point out what the most common features are.

These requirements alone can lead into confusion and the picture is not clear yet. So let’s start from the historical beginning to keep things simple.

chroot

Mostly every UNIX operating system has the possibility to change the root directory of the current running process (and its children). This originates from the first occurrence of chroot in UNIX Version 7 (released 1979), from where it continued the journey into the awesome Berkeley Software Distribution (BSD). In Linux you can nowadays chroot(2) as system call (a kernel API function call) or the corresponding standalone wrapper program. Chroot is also referenced as “jail“, because some person used it as a honeypot to monitor a security hacker back in 1991. So chroot is much older than Linux and it has been (mis)used in the early 2000s for the first approaches in running applications as what we would call today “microservices”. Chroot is currently used by a wide range of applications, for example within build services for different distributions. Nowadays the BSD implementation differs a lots from the Linux one, where we will focus on the latter part for now.

What is needed to run an own chroot environment? Not that much, since something like this already works:

> mkdir -p new-root/{bin,lib64}
> cp /bin/bash new-root/bin
> cp /lib64/{ld-linux-x86-64.so*,libc.so*,libdl.so.2,libreadline.so*,libtinfo.so*} new-root/lib64
> sudo chroot new-root

We create a new root directory, copy a bash shell and its dependencies in and run chroot. This jail is pretty useless: All we have at hand is bash and its builtin functions like cd and pwd.

One might think it could be worth running a statically linked binary in a jail and that would be the same as running a container image. It’s absolutely not, and a jail is not really a standalone security feature but more a good addition to our container world.

The current working directory is left unchanged when calling chroot via a syscall, whereas relative paths can still refer to files outside of the new root. This call changes only the root path and nothing else. Beside this, further calls to chroot do not stack and they will override the current jail. Only privileged processes with the capability CAP_SYS_CHROOT are able to call chroot. At the end of the day the root user can easily escape from a jail by running a program like this:

#include <sys/stat.h>
#include <unistd.h>

int main(void)
{
    mkdir(".out", 0755);
    chroot(".out");
    chdir("../../../../../");
    chroot(".");
    return execl("/bin/bash", "-i", NULL);
}

We create a new jail by overwriting the current one and change the working directly to some relative path outside of the chroot environment. Another call to chroot might bring us outside of the jail which can be verified by spawning a new interactive bash shell.

Nowadays chroot is not used by container runtimes any more and was replaced by pivot_root(2), which has the benefit of putting the old mounts into a separate directory on calling. These old mounts could be unmounted afterwards to make the filesystem completely invisible to broken out processes.

To continue with a more useful jail we need an appropriate root filesystem (rootfs). This contains all binaries, libraries and the necessary file structure. But where to get one? What about peeling it from an already existing Open Container Initiative (OCI) container, which can be easily done with the two tools skopeo and umoci:

> skopeo copy docker://opensuse/tumbleweed:latest oci:tumbleweed:latest

[output removed]

> sudo umoci unpack –image tumbleweed:latest bundle

[output removed]

Now with our freshly downloaded and extracted rootfs we can chroot into the jail via:

> sudo chroot bundle/rootfs
#

It looks like we’re running inside a fully working environment, right? But what did we achieve? We can see that we may sneak-peak outside the jail from a process perspective:

> mkdir /proc
> mount -t proc proc /proc
> ps aux

[output removed]

There is no process isolation available at all. We can even kill programs running outside of the jail, what a metaphor! Let’s peek into the network devices:

> mkdir /sys
> mount -t sysfs sys /sys
> ls /sys/class/net
eth0 lo

There is no network isolation, too. This missing isolation paired with the ability to leave the jail leads into lots of security related concerns, because jails are sometimes used for wrong (security related) purposes. How to solve this? This is where the Linux namespaces join the party.

Linux Namespaces

Namespaces are a Linux kernel feature which were introduced back in 2002 with Linux 2.4.19. The idea behind a namespace is to wrap certain global system resources in an abstraction layer. This makes it appear like the processes within a namespace have their own isolated instance of the resource. The kernels namespace abstraction allows different groups of processes to have different views of the system.

Not all available namespaces were implemented from the beginning. A full support for what we now understand as “container ready” was finished in kernel version 3.8 back in 2013 with the introduction of the user namespace. We end up having currently seven distinct namespaces implemented: mnt, pid, net, ipc, uts, user and cgroup. No worries, we will discuss them in detail. In September 2016 two additional namespaces were proposed (time and syslog) which are not fully implemented yet. Let’s have a look into the namespace API before digging into certain namespaces.

API

The namespace API of the Linux kernel consists of three main system calls:

clone

The clone(2) API function creates a new child process, in a manner similar to fork(2). Unlike fork(2), the clone(2) API allows the child process to share parts of its execution context with the calling process, such as the memory space, the table of file descriptors, and the table of signal handlers. You can pass different namespace flags to clone(2)to create new namespaces for the child process.

unshare

The function unshare(2) allows a process to disassociate parts of the execution context which are currently being shared with others.

setns

The function setns(2) reassociates the calling thread with the provided namespace file descriptor. This function can be used to join an existing namespace.

proc

Besides the available syscalls, the proc filesystem populates additional namespace related files. Since Linux 3.8, each file in /proc/$PID/ns is a “magic“ link which can be used as a handle for performing operations (like setns(2)) to the referenced namespace.

> ls -Gg /proc/self/ns/
total 0
lrwxrwxrwx 1 0 Feb  6 18:32 cgroup -> 'cgroup:[4026531835]'
lrwxrwxrwx 1 0 Feb  6 18:32 ipc -> 'ipc:[4026531839]'
lrwxrwxrwx 1 0 Feb  6 18:32 mnt -> 'mnt:[4026531840]'
lrwxrwxrwx 1 0 Feb  6 18:32 net -> 'net:[4026532008]'
lrwxrwxrwx 1 0 Feb  6 18:32 pid -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb  6 18:32 pid_for_children -> 'pid:[4026531836]'
lrwxrwxrwx 1 0 Feb  6 18:32 user -> 'user:[4026531837]'
lrwxrwxrwx 1 0 Feb  6 18:32 uts -> 'uts:[4026531838]'

This allows us for example to track in which namespaces certain processes reside. Another way to play around with namespaces apart from the programmatic approach is using tools from the util-linux package. This contains dedicated wrapper programs for the mentioned syscalls. One handy tool related to namespaces within this package is lsns. It lists useful information about all currently accessible namespaces or about a single given one. But now let’s finally get our hands dirty.

Available Namespaces

Mount (mnt)

The first namespace we want to try out is the mnt namespace, which was the first implemented one back in 2002. During that time (mostly) no one thought that multiple namespaces would ever be needed, so they decided to call the namespace clone flag CLONE_NEWNS. This leads into a small inconsistency with other namespace clone flags (I see you suffering!). With the mnt namespace Linux is able to isolate a set of mount points by a group of processes.

A great use case of the mnt namespace is to create environments similar to jails, but in a more secure fashion. How to create such a namespace? This can be easily done via an API function call or the unshare command line tool. So we can do this:

> sudo unshare -m
# mkdir mount-dir
# mount -n -o size=10m -t tmpfs tmpfs mount-dir
# df mount-dir
Filesystem     1K-blocks  Used Available Use% Mounted on
tmpfs              10240     0     10240   0% <PATH>/mount-dir
# touch mount-dir/{0,1,2}

Looks like we have a successfully mounted tmpfs, which is not available on the host system level:

> ls mount-dir
> grep mount-dir /proc/mounts
>

The actual memory being used for the mount point is laying in an abstraction layer called Virtual File System (VFS), which is part of the kernel and where every other filesystem is based on. If the namespace gets destroyed, the mount memory is unrecoverably lost. The mount namespace abstraction gives us the possibility to create entire virtual environments in which we are the root user even without root permissions.

On the host system we are able to see the mount point via the mountinfo file inside of the proc filesystem:

> grep mount-dir /proc/$(pgrep -u root bash)/mountinfo
349 399 0:84 / /mount-dir rw,relatime - tmpfs tmpfs rw,size=1024k

How to work with these mount points on a source code level? Well, programs tend to keep a file handle on the corresponding /proc/$PID/ns/mnt file, which refers to the used namespace. In the end mount namespace related implementation scenarios can be really complex, but they give us the power to create flexible container filesystem trees. The last thing I want to mention is that mounts can have different flavors (shared, slave, private, unbindable), which is best explained within the shared subtree documentation of the Linux kernel.

UNIX Time-sharing System (uts)

The UTS namespace was introduced in Linux 2.6.19 (2006) and allows us to unshare the domain- and hostname from the current host system. Let’s give it a try:

> sudo unshare -u
# hostname
nb
# hostname new-hostname
# hostname
new-hostname

And if we look at the system level nothing has changed, hooray:

> hostname
nb

The UTS namespace is yet another nice addition in containerization, especially when it comes to container networking related topics.

Interprocess Communication (ipc)

IPC namespaces came with Linux 2.6.19 (2006) too and isolate interprocess communication (IPC) resources. In special these are System V IPC objects and POSIX message queues. One use case of this namespace would be to separate the shared memory (SHM) between two processes to avoid misusage. Instead, each process will be able to use the same identifiers for a shared memory segment and produce two distinct regions. When an IPC namespace is destroyed, then all IPC objects in the namespace are automatically destroyed, too.

Process ID (pid)

The PID namespace was introduced in Linux 2.6.24 (2008) and gives processes an independent set of process identifiers (PIDs). This means that processes which reside in different namespaces can own the same PID. In the end a process has two PIDs: the PID inside the namespace, and the PID outside the namespace on the host system. The PID namespaces can be nested, so if a new process is created it will have a PID for each namespace from its current namespace up to the initial PID namespace.

The first process created in a PID namespace gets the number 1 and gains all the same special treatment as the usual init process. For example, all processes within the namespace will be re-parented to the namespace’s PID 1 rather than the host PID 1. In addition the termination of this process will immediately terminate all processes in its PID namespace and any descendants. Let’s create a new PID namespace:

> sudo unshare -fp --mount-proc
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.4  0.6  18688  6608 pts/0    S    23:15   0:00 -bash
root        39  0.0  0.1  35480  1768 pts/0    R+   23:15   0:00 ps aux

Looks isolated, doesn’t it? The --mount-proc flag is needed to re-mount the proc filesystem from the new namespace. Otherwise we would not see the PID subtree corresponding with the namespace. Another option would be to manually mount the proc filesystem via mount -t proc proc /proc, but this also overrides the mount from the host where it has to be remounted afterwards.

Network (net)

Network namespaces were completed in Linux 2.6.29 (2009) and can be used to virtualize the network stack. Each network namespace contains its own resource properties within /proc/net. Furthermore, a network namespace contains only a loopback interface on initial creation. Let’s create one:

> sudo unshare -n
# ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Every network interface (physical or virtual) is present exactly once per namespace. It is possible that an interface will be moved between namespaces. Each namespace contains a private set of IP addresses, its own routing table, socket listing, connection tracking table, firewall, and other network-related resources.

Destroying a network namespace destroys any virtual and moves any physical interfaces within it back to the initial network namespace.

A possible use case for the network namespace is creating Software Defined Networks (SDN) via virtual Ethernet (veth) interface pairs. One end of the network pair will be plugged into a bridged interface whereas the other end will be assigned to the target container. This is how pod networks like flannel work in general.

Let’s see how it works. First, we need to create a new network namespace, which can be done via ip, too:

> sudo ip netns add mynet
> sudo ip netns list
mynet

So we created a new network namespace called mynet. When ip creates a network namespace, it will create a bind mount for it under /var/run/netns too. This allows the namespace to persist even when no processes are running within it.

With ip netns exec we can inspect and manipulate our network namespace even further:

> sudo ip netns exec mynet ip l
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> sudo ip netns exec mynet ping 127.0.0.1
connect: Network is unreachable

The network seems down, let’s bring it up:

> sudo ip netns exec mynet ip link set dev lo up
> sudo ip netns exec mynet ping 127.0.0.1
PING 127.0.0.1 (127.0.0.1) 56(84) bytes of data.
64 bytes from 127.0.0.1: icmp_seq=1 ttl=64 time=0.016 ms

Hooray! Now let’s create a veth pair which should allow communication later on:

> sudo ip link add veth0 type veth peer name veth1
> sudo ip link show type veth
11: veth1@veth0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether b2:d1:fc:31:9c:d3 brd ff:ff:ff:ff:ff:ff
12: veth0@veth1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0f:37:18:76:52 brd ff:ff:ff:ff:ff:ff

Both interfaces are automatically connected, which means that packets sent to veth0 will be received by veth1 and vice versa. Now we associate one end of the veth pair to our network namespace:

> sudo ip link set veth1 netns mynet
> ip link show type veth
12: veth0@if11: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether ca:0f:37:18:76:52 brd ff:ff:ff:ff:ff:ff link-netns mynet

Our network interfaces need some addresses for sure:

> sudo ip netns exec mynet ip addr add 172.2.0.1/24 dev veth1
> sudo ip netns exec mynet ip link set dev veth1 up
> sudo ip addr add 172.2.0.2/24 dev veth0
> sudo ip link set dev veth0 up

Communicating in both directions should now be possible:

> ping -c1 172.2.0.1
PING 172.2.0.1 (172.2.0.1) 56(84) bytes of data.
64 bytes from 172.2.0.1: icmp_seq=1 ttl=64 time=0.036 ms

--- 172.2.0.1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.036/0.036/0.036/0.000 ms
> sudo ip netns exec mynet ping -c1 172.2.0.2
PING 172.2.0.2 (172.2.0.2) 56(84) bytes of data.
64 bytes from 172.2.0.2: icmp_seq=1 ttl=64 time=0.020 ms

--- 172.2.0.2 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.020/0.020/0.020/0.000 ms

It works, but we wouldn’t have any internet access from the network namespace. We would need a network bridge or something similar for that and a default route from the namespace. I leave this task up to you, for now let’s go on to the next namespace.

User ID (user)

With Linux 3.5 (2012) the isolation of user and group IDs was finally possible via namespaces. Linux 3.8 (2013) made it possible to create user namespaces even without being actually privileged. The user namespace enables that a user and group IDs of a process can be different inside and outside of the namespace. An interesting use-case is that a process can have a normal unprivileged user ID outside a user namespace while being fully privileged inside.

Let’s give it a try:

> id -u
1000
> unshare -U
> whoami
nobody

After the namespace creation, the files /proc/$PID/{u,g}id_map expose the mappings for user and group IDs for the PID. These files can be written only once to define the mappings.

In general each line within these files contain a one to one mapping of a range of contiguous user IDs between two user namespaces and could look like this:

> cat /proc/$PID/uid_map
0 1000 1

The example above translates to: With the starting user ID 0 the namespace maps to a range starting at ID 1000. This applies only to the user with the ID 1000, since the defined length is 1.

If now a process tries to access a file, its user and group IDs are mapped into the initial user namespace for the purpose of permission checking. When a process retrieves file user and group IDs (via stat(2)), the IDs are mapped in the opposite direction.

In the unshare example (we did above) we implicitly call getuid(2) before writing an appropriate user mapping, which will result in an unmapped ID. This unmapped ID is automatically converted to the overflow user ID (65534 or the value in /proc/sys/kernel/overflow{g,u}id).

The file /proc/$PID/setgroups contains either allow or deny to enable or disable the permission to call thesetgroups(2) syscall within the user namespace. The file was added to address an added security issue introduced with the user namespace: It would be possible to an unprivileged process to create a new namespace in which the user had all privileges. This formerly unprivileged user would be able to drop groups via setgroups(2) to gain access to files he previously not had.

In the end the user namespace enables great security additions to the container world, which are essential for running rootless containers.

Control Group (cgroup)

Cgroups started their journey 2008 with Linux 2.6.24 as dedicated Linux kernel feature. The main goal of cgroups is to support resource limiting, prioritization, accounting and controlling. A major redesign started with version 2 in 2013, whereas the cgroup namespace was added with Linux 4.6 (2016) to prevent leaking host information into a namespace. The second version of cgroups were released there too and major features were added since then. One latest example is an Out-of-Memory (OOM) killer which adds an ability to kill a cgroup as a single unit to guarantee the overall integrity of the workload.

Let’s play around with cgroups and create a new one. By default, the kernel exposes cgroups in /sys/fs/cgroup. To create a new cgroup, we simply create a new sub-directory on that location:

> sudo mkdir /sys/fs/cgroup/memory/demo
> ls /sys/fs/cgroup/memory/demo
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks

You can see that there are already some default values exposed there. Now, we are able to set the memory limits for that cgroup. We are also turning off swap to make our example implementation work.

> sudo su
# echo 100000000 > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
# echo 0 > /sys/fs/cgroup/memory/demo/memory.swappiness

To assign a process to a cgroup we can write the corresponding PID to the cgroup.procs file:

# echo $$ > /sys/fs/cgroup/memory/demo/cgroup.procs

Now we can execute a sample application to consume more than the allowed 100 megabytes of memory. The application I used is written in Rust and looks like this:

pub fn main() {
    let mut vec = vec![];
    loop {
        vec.extend_from_slice(&[1u8; 10_000_000]);
        println!("{}0 MB", vec.len() / 10_000_000);
    }
}

If we run the program, we see that the PID will be killed because of the set memory constraints. So our host system is still usable.

# rustc memory.rs
# ./memory
10 MB
20 MB
30 MB
40 MB
50 MB
60 MB
70 MB
80 MB
90 MB
Killed

Composing Namespaces

Namespaces are composable, too! This reveals their true power and makes it possible to have isolated pid namespaces which share the same network interface, like it is done in Kubernetes Pods.

To demonstrate this, let’s create a new namespace with an isolated PID:

> sudo unshare -fp --mount-proc
# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.6  18688  6904 pts/0    S    23:36   0:00 -bash
root        39  0.0  0.1  35480  1836 pts/0    R+   23:36   0:00 ps aux

The setns(2) syscall with its appropriate wrapper program nsenter can now be used to join the namespace. For this we have to find out which namespace we want to join:

> export PID=$(pgrep -u root bash)
> sudo ls -l /proc/$PID/ns

Now, it is easily possible to join the namespace via nsenter:

> sudo nsenter --pid=/proc/$PID/ns/pid unshare --mount-proc
# ps aux
root         1  0.1  0.0  10804  8840 pts/1    S+   14:25   0:00 -bash
root        48  3.9  0.0  10804  8796 pts/3    S    14:26   0:00 -bash
root        88  0.0  0.0   7700  3760 pts/3    R+   14:26   0:00 ps aux

We can now see that we are member of the same PID namespace! It is also possible to enter already running containers via nsenter, but this topic will be covered later on.

Demo Application

A small demo application can be used to create a simple isolated environment via the namespace API:

#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <string.h>
#include <sys/mount.h>
#include <sys/msg.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>

#define STACKSIZE (1024 * 1024)
static char stack[STACKSIZE];

void print_err(char const * const reason)
{
    fprintf(stderr, "Error %s: %s\n", reason, strerror(errno));
}

int exec(void * args)
{
    // Remount proc
    if (mount("proc", "/proc", "proc", 0, "") != 0) {
        print_err("mounting proc");
        return 1;
    }

    // Set a new hostname
    char const * const hostname = "new-hostname";
    if (sethostname(hostname, strlen(hostname)) != 0) {
        print_err("setting hostname");
        return 1;
    }

    // Create a message queue
    key_t key = {0};
    if (msgget(key, IPC_CREAT) == -1) {
        print_err("creating message queue");
        return 1;
    }

    // Execute the given command
    char ** const argv = args;
    if (execvp(argv[0], argv) != 0) {
        print_err("executing command");
        return 1;
    }

    return 0;
}

int main(int argc, char ** argv)
{
    // Provide some feedback about the usage
    if (argc < 2) {
        fprintf(stderr, "No command specified\n");
        return 1;
    }

    // Namespace flags
    const int flags = CLONE_NEWNET | CLONE_NEWUTS | CLONE_NEWNS | CLONE_NEWIPC |
                      CLONE_NEWPID | CLONE_NEWUSER | SIGCHLD;

    // Create a new child process
    pid_t pid = clone(exec, stack + STACKSIZE, flags, &argv[1]);

    if (pid < 0) {
        print_err("calling clone");
        return 1;
    }

    // Wait for the process to finish
    int status = 0;
    if (waitpid(pid, &status, 0) == -1) {
        print_err("waiting for pid");
        return 1;
    }

    // Return the exit code
    return WEXITSTATUS(status);
}

Purpose of the application is to spawn a new child process in different namespaces. Every command provided to the executable will be forwarded to the new child process. The application terminates, when the command execution is done. You can test and verify the implementation via:

> gcc -o namespaces namespaces.c
> ./namespaces ip a
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> ./namespaces ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nobody       1  0.0  0.1  36524  1828 pts/0    R+   23:46   0:00 ps aux
> ./namespaces whoami
nobody

This is truly not a working container, but it should give you a slight feeling about how container runtimes leverage namespaces to manage containers. Feel free to use this example as a starting point for your own little experiments with the namespace API.

Putting it all Together

Do you remember the rootfs we extracted from the image within the chroot section? We can use a low level container runtime like runc to easily run a container from the rootfs:

> sudo runc run -b bundle container

If we now inspect the system namespaces, we see that runc already created mnt, uts, ipc, pid and net for us:

> sudo lsns | grep bash
4026532499 mnt         1  6409 root   /bin/bash
4026532500 uts         1  6409 root   /bin/bash
4026532504 ipc         1  6409 root   /bin/bash
4026532505 pid         1  6409 root   /bin/bash
4026532511 net         1  6409 root   /bin/bash

I will stop here and we will learn more about container runtimes and what they do, in upcoming blog posts and talks.

Conclusion

I really hope you enjoyed the read and that the mysteries about containers are now a little bit more fathomable. If you run Linux it is easy to play around with different isolation techniques from scratch. In the end a container runtime nicely uses all these isolation features on different abstraction levels to provide a stable and robust development and production platform for containers.

There are lots of topics which were not covered here because I wanted to stay at a stable level of detail. For sure, a great resource for digging deeper into the topic of Linux namespaces is the Linux programmers manual: NAMESPACES(7).

Feel free to drop me a line or get in contact with me for any questions or feedback. The next blog posts will cover container runtimes, security and the overall ecosystem around latest container technologies. Stay tuned!

You can find all necessary resources about this series on GitHub.

Apache server-status

To the uninitiated, the mod_status output can look like so much gobbledegook, but it’s really quite straightforward. Let’s take a look at some sample output.

Apache Server Status for somedomain.com
Server Version: Apache/1.3.9 (Unix) PHP/4.0b3 
Server Built: Mar 4 2000 17:01:01

The first few lines identify and provide a brief description of your server. The server version information includes an incomplete list of some of the modules compiled into your server. Our example server is running on a Unix system and has been compiled with support for the PHP scripting language. (The level of detail provided by the server version line may be limited by the ServerTokens configuration directive.)

Current Time: Thursday, 13-Apr-2000 17:22:36 PDT
Restart Time: Thursday, 13-Apr-2000 17:15:26 PDT
Parent Server Generation: 14
Server uptime: 7 minutes 10 seconds
Total accesses: 42 - Total Traffic: 187 kB
CPU Usage: u.1 s.1 cu0 cs0 - .0465% CPU load
.0977 requests/sec - 445 B/second - 4559 B/request
3 requests currently being processed, 5 idle servers

The next block represents the server’s current state. Our example server has only been up for a few minutes and hasn’t yet seen much activity. It is currently dealing with three requests, one of which is my request for the server status itself. The message that five servers are idle servers is a clue that this server is configured to maintain a pool of at least five spare child processes ready to spring into action should the need arise.

K___K_W_........................................................
................................................................
................................................................
................................................................

Scoreboard Key:
   "_" Waiting for Connection, "S" Starting up, "R" Reading Request,
   "W" Sending Reply, "K" KeepAlive (read), "D" DNS Lookup, "L" Logging,
   "G" Gracefully finishing, "." Open slot with no current process

No, that’s not boring morse-code; it’s the “scoreboard,” a pseudo-graphical representation of the state of the server’s child processes. According to the included Scoreboard Key, our server is replying to one request, maintaining two KeepAlive connections, and is maintaining five idle processes. A busier server’s scoreboard would look more like:


WWKW__WW_KKKWK__KKKKWKKKKK_WKKK_KK__KRWKKK__KK___K____WKK__KWWKK
_K___K___WWKWWW_W_W_WWWK_WW_WWWLWWW_KWWKKWKWWKWWKKWW_KWKKKKW__WK
WKWWW_KKWKKKWK_KW_KKKK__KK_KKKWWK_KW__K_KKK_K..........W........
................................................................

For more on pool regulation and KeepAlive, see my earlier HTTP Wrangler column, “An Amble Through Apache Configuration.”

Srv  PID   Acc     M CPU  SS Req Conn Child Slot
0-14 29987 0/24/24 W 0.09 2  0   0.0  0.16  0.16

Client     VHost Request     Request
127.0.0.1  www.mydomain.net  GET /server-status HTTP/1.0

In addition to a more general overview of your server’s activity, mod_status gets down to the nitty-gritty, displaying a snapshot of the individual requests it is currently handling. Let’s take a gander at a fairly representative request. Please note that the output above has been split in half for display purposes.

0-14 Srv
The ID of the child process and its generation. The generation increases each time a child process is restarted, whether due to a server-restart or a limit placed on the number of processes a child is allowed to handle. See the MaxRequestsPerChild directive.
29987 PID
The child’s process ID.
0/24/24 Acc
The first number in this trio is the number of accesses or requests using this connection. For non-KeepAlive connections, this will be 0 since each request makes its own connection and so is always the first (and last). The second is the number of requests handled thus far by this child. The third is the number of requests handled by this slot; the child may have come and gone, its slot taken by another.
W Mode
The child’s mode of operation; one of the following possibilities:

"_" Waiting for Connection, "S" Starting up, 
"R" Reading Request, "W" Sending Reply,
 "K" KeepAlive (read), 
"D" DNS Lookup, "L" Logging, "G" Gracefully finishing, 
"." Open slot with no current process
0.09
2
0
0.0
0.16
0.16
CPU SS Req Conn Child Slot
Some of the less useful bits and pieces…

CPU: The child’s CPU usage in number of seconds.
SS: Seconds elapsed since the beginning of the request.
Req: Milliseconds taken to process the request.
Conn: Kilobytes transferred across this connection.
Child: Megabytes transferred by this child process.
Slot: Megabytes transferred by this slot, across children.

www.mydomain.net VHost
Perhaps your server hosts multiple virtual domains; how would you determine which page is being requested by GET /index.html?. The VHost column helps you sort out which request is coming to which virtual host — in this example, www.mydomain.net.
GET /server-status HTTP/1.0 Request
This particular hit is my request for server-status. The GET bit indicates a simple request for a document (as opposed to sending data to the server using POST). The browser (in this case the Unix command-line wget program) is using HTTP version 1.0.

For more on HTTP, see my earlier HTTP Wrangler column, “Introducing Apache.”

Installation

So how do you install and configure mod_status? I make the assumption here that you built and installed Apache from source. If you’re not familiar with building Apache, may I suggest you read my earlier HTTP Wrangler column, Getting, Installing, and Running Apache.

First, move into your Apache source directory.

% cd /usr/local/src/apache_1.3.x

Thankfully Apache’s configure script creates a cache file, config.status, saving us the bother of completely reconfiguring our Apache build from scratch. All we need to do is run config.status, supplying the one argument necessary to add mod_status.

If you’ve not already done so, now would be the time to become root.

# ./config.status --enable-module=status
Configuring for Apache, Version 1.3.11
...
Creating Makefile
Creating Configuration.apaci in src
Creating Makefile in src
 + configured for Linux platform
 + setting C compiler to gcc
 + setting C pre-processor to gcc -E
 + checking for system header files
 + adding selected modules
 + checking sizeof various data types
 + doing sanity check on compiler and options
...
Creating Makefile in src/modules/standard

Note: Apache’s configure script automagically updates config.status to include mod_status; next time you configure you will not need to enable mod_status again.

Now that we’ve reconfigured Apache, let’s rebuild.

# make

Your screen should look something like:

# make
===&amp;gt; src
make[1]: Entering directory `src/httpd/apache_1.3.11'
make[2]: Entering directory `src/httpd/apache_1.3.11/src'
===&amp;gt; src/regex
...
[several unsightly lines later]
...
gcc  -DLINUX=2 -DUSE_HSREGEX -DUSE_EXPAT -I../lib/expat-lite 
-DNO_DL_NEEDED `../apaci` -o ab   -L../os/unix -L../ap ab.o 
-lap -los  -lm -lcrypt
make[2]: Leaving directory `src/httpd/apache_1.3.11/src/support'
&amp;lt;=== src/support
make[1]: Leaving directory `src/httpd/apache_1.3.11'
&amp;lt;=== src
#

Finally, you’re ready to install your newly freshly built Apache.

# make install

(While not strictly necessary — reinstalling should only overwrite files that probably haven’t changed since your last install — I always advise backing up your Apache directory.)

Configuration

Mod_status is easy to configure; in fact the directives already exist in your httpd.conf file and simply need to be uncommented and edited slightly. If you’re not familiar with Apache configuration, may I suggest you read my earlier HTTP Wrangler column, An Amble Through Apache Configuration.

# cd /usr/local/apache/conf

(or wherever your Apache installation’s configuration files are located)

Open your httpd.conf file in the text editor of your choice and search for the following set of configuration directives:

# Allow server status reports, with the URL of http://servername/server-status
# Change the ".your_domain.com" to match your domain to enable.
#
#<Location /server-status>
#    SetHandler server-status
#    Order deny,allow
#    Deny from all
#    Allow from .your_domain.com
#</Location>

Uncomment everything from &amp;lt;Location /server-status&amp;gt; to &amp;lt;/Location&amp;gt; by removing the # characters from the beginning of each line.

It’s wise to protect your server-status output from prying eyes. The easiest way to do this is to restrict its access to one computer or domain. Change the .your_domain.com to the name of a computer or domain you wish to allow a peek at server-status. For example, if you’re the webmaven for your server, you may want to allow only your computer,mycomputer.mydomain.org access, your server-status configuration would look something like:

# Allow server status reports, with the URL of http://servername/server-status
# Change the ".your_domain.com" to match your domain to enable.
#
<Location /server-status>
    SetHandler server-status
    Order deny,allow
    Deny from all
    Allow from mycomputer.mydomain.org
</Location>

Only one tiny piece left. The default status display isn’t as detailed as what you I showed you above. The more abbreviated version looks something like:

PID Key:
   29955 in state: _ ,   29956 in state: _ ,   29957 in state: _
   29958 in state: _ ,   29959 in state: W ,   29978 in state: _

In order to see all the gory details, you need to enable “full” status. Find the following lines and uncomment (remove the initial #) the ExtendedStatus directive; the result should look like:

# ExtendedStatus controls whether Apache will generate "full" status
# information (ExtendedStatus On) or just basic information (ExtendedStatus
# Off) when the "server-status" handler is called. The default is Off.
#
ExtendedStatus On

That’s all there is to mod_status configuration. Save your httpd.conf file, shut down and start Apache.

# /usr/local/apache/sbin/apachectl stop
/usr/local/apache/sbin/apachectl stop: httpd stopped
# /usr/local/apache/sbin/apachectl start
/usr/local/apache/sbin/apachectl start: httpd started
#

Fire up your Web browser on a machine allowed access to your server’s server-status and point it at the URL:

http://servername/server-status

Happy reading! For more information on mod_status and other aspects of Apache we touched on along the way, visit the Resources section below.

Logging the php mail function

From php version 5.3.0 we can use the directive mail.log to log who’s calling the function mail(). When someone calls the function mail() from a php script we can get some info about the sender in our log.

I will enable logging globally. You can choose yourself where to activate it, editing your php.ini for cli, cgi, apache2, fpm…

To enable it globally:

sudo echo “mail.log = /var/log/phpmail.log” > /etc/php5/conf.d/mail.ini

phpmail.log is the log filename used in my example. Then create the file:

touch /var/log/phpmail.log

chmod 777 /var/log/phpmail.log

…and restart apache or process manager you are using:

/etc/init.d/apache2 restart

or

/etc/init.d/php5-fpm restart

Troubleshooting High I/O Wait in Linux

Linux has many tools available for troubleshooting some are easy to use, some are more advanced.

I/O Wait is an issue that requires use of some of the more advanced tools as well as an advanced usage of some of the basic tools. The reason I/O Wait is difficult to troubleshoot is due to the fact that by default there are plenty of tools to tell you that your system is I/O bound, but not as many that can narrow the problem to a specific process or processes.

Answering whether or not I/O is causing system slowness

To identify whether I/O is causing system slowness you can use several commands but the easiest is the unix command top.

 # top
 top - 14:31:20 up 35 min, 4 users, load average: 2.25, 1.74, 1.68
 Tasks: 71 total, 1 running, 70 sleeping, 0 stopped, 0 zombie
 Cpu(s): 2.3%us, 1.7%sy, 0.0%ni, 0.0%id, 96.0%wa, 0.0%hi, 0.0%si, 0.0%st
 Mem: 245440k total, 241004k used, 4436k free, 496k buffers
 Swap: 409596k total, 5436k used, 404160k free, 182812k cached

From the CPU(s) line you can see the current percentage of CPU in I/O Wait; The higher the number the more cpu resources are waiting for I/O access.

wa -- iowait
 Amount of time the CPU has been waiting for I/O to complete.

Finding which disk is being written to

The above top command shows I/O Wait from the system as a whole but it does not tell you what disk is being affected; for this we will use the iostat command.

 $ iostat -x 2 5
 avg-cpu: %user %nice %system %iowait %steal %idle
  3.66 0.00 47.64 48.69 0.00 0.00

 Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
 sda 44.50 39.27 117.28 29.32 11220.94 13126.70 332.17 65.77 462.79 9.80 2274.71 7.60 111.41
 dm-0 0.00 0.00 83.25 9.95 10515.18 4295.29 317.84 57.01 648.54 16.73 5935.79 11.48 107.02
 dm-1 0.00 0.00 57.07 40.84 228.27 163.35 8.00 93.84 979.61 13.94 2329.08 10.93 107.02

The iostat command in the example will print a report every 2 seconds for 5 intervals; the -x tells iostat to print out an extended report.

The 1st report from iostat will print statistics based on the last time the system was booted; for this reason in most circumstances the first report from iostat should be ignored. Every sub-sequential report printed will be based on the time since the previous interval. For example in our command we will print a report 5 times, the 2nd report are disk statistics gathered since the 1st run of the report, the 3rd is based from the 2nd and so on.

In the above example the %utilized for sda is 111.41% this is a good indicator that our problem lies with processes writing to sda. While the test system in my example only has 1 disk this type of information is extremely helpful when the server has multiple disks as this can narrow down the search for which process is utilizing I/O.

Aside from %utilized there is a wealth of information in the output of iostat; items such as read and write requests per millisecond(rrqm/s & wrqm/s), reads and writes per second (r/s & w/s) and plenty more. In our example our program seems to be read and write heavy this information will be helpful when trying to identify the offending process.

Finding the processes that are causing high I/O

iotop

 # iotop
 Total DISK READ: 8.00 M/s | Total DISK WRITE: 20.36 M/s
  TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
 15758 be/4 root 7.99 M/s 8.01 M/s 0.00 % 61.97 % bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp

The simplest method of finding which process is utilizing storage the most is to use the command iotop. After looking at the statistics it is easy to identify bonnie++ as the process causing the most I/O utilization on this machine.

While iotop is a great command and easy to use, it is not installed on all (or the main) Linux distributions by default; and I personally prefer not to rely on commands that are not installed by default. A systems administrator may find themselves on a system where they simply cannot install the non-defualt packages until a scheduled time which may be far too late depending on the issue.

If iotop is not available the below steps will also allow you to narrow down the offending process/processes.

Process list “state”

The ps command has statistics for memory and cpu but it does not have a statistic for disk I/O. While it may not have a statistic for I/O it does show the processes state which can be used to indicate whether or not a process is waiting for I/O.

The ps state field provides the processes current state; below is a list of states from the man page.

PROCESS STATE CODES
 D uninterruptible sleep (usually IO)
 R running or runnable (on run queue)
 S interruptible sleep (waiting for an event to complete)
 T stopped, either by a job control signal or because it is being traced.
 W paging (not valid since the 2.6.xx kernel)
 X dead (should never be seen)
 Z defunct ("zombie") process, terminated but not reaped by its parent.

Processes that are waiting for I/O are commonly in an “uninterruptible sleep” state or “D”; given this information we can simply find the processes that are constantly in a wait state.

Example:

 # for x in `seq 1 1 10`; do ps -eo state,pid,cmd | grep "^D"; echo "----"; sleep 5; done
 D 248 [jbd2/dm-0-8]
 D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp
 ----
 D 22 [kswapd0]
 D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp
 ----
 D 22 [kswapd0]
 D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp
 ----
 D 22 [kswapd0]
 D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp
 ----
 D 16528 bonnie++ -n 0 -u 0 -r 239 -s 478 -f -b -d /tmp
 ----

The above for loop will print the processes in a “D” state every 5 seconds for 10 intervals.

From the output above the bonnie++ process with a pid of 16528 is waiting for I/O more often than any other process. At this point the bonnie++ seems likely to be causing the I/O Wait, but just because the process is in an uninterruptible sleep state does not necessarily prove that it is the cause of I/O wait.

To help confirm our suspicions we can use the /proc file system. Within each processes directory there is a file called “io” which holds the same I/O statistics that iotop is utilizing.

 # cat /proc/16528/io
 rchar: 48752567
 wchar: 549961789
 syscr: 5967
 syscw: 67138
 read_bytes: 49020928
 write_bytes: 549961728
 cancelled_write_bytes: 0

The read_bytes and write_bytes are the number of bytes that this specific process has written and read from the storage layer. In this case the bonnie++ process has read 46 MB and written 524 MB to disk. While for some processes this may not be a lot, in our example this is enough write and reads to cause the high i/o wait that this system is seeing.

Finding what files are being written too heavily

The lsof command will show you all of the files open by a specific process or all processes depending on the options provided. From this list one can make an educated guess as to what files are likely being written to often based on the size of the file and the amounts present in the “io” file within /proc.

To narrow down the output we will use the -p <pid> options to print only files open by the specific process id.

 # lsof -p 16528
 COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
 bonnie++ 16528 root cwd DIR 252,0 4096 130597 /tmp
 <truncated>
 bonnie++ 16528 root 8u REG 252,0 501219328 131869 /tmp/Bonnie.16528
 bonnie++ 16528 root 9u REG 252,0 501219328 131869 /tmp/Bonnie.16528
 bonnie++ 16528 root 10u REG 252,0 501219328 131869 /tmp/Bonnie.16528
 bonnie++ 16528 root 11u REG 252,0 501219328 131869 /tmp/Bonnie.16528
 bonnie++ 16528 root 12u REG 252,0 501219328 131869 <strong>/tmp/Bonnie.16528</strong>

To even further confirm that these files are being written to the heavily we can see if the /tmp filesystem is part of sda.

 # df /tmp
 Filesystem 1K-blocks Used Available Use% Mounted on
 /dev/mapper/workstation-root 7667140 2628608 4653920 37% /

From the output of df we can determine that /tmp is part of the root logical volume in the workstation volume group.

 # pvdisplay
  --- Physical volume ---
  PV Name /dev/sda5
  VG Name workstation
  PV Size 7.76 GiB / not usable 2.00 MiB
  Allocatable yes
  PE Size 4.00 MiB
  Total PE 1986
  Free PE 8
  Allocated PE 1978
  PV UUID CLbABb-GcLB-l5z3-TCj3-IOK3-SQ2p-RDPW5S

Using pvdisplay we can see that the /dev/sda5 partition part of the sda disk is the partition that the workstation volume group is using and in turn is where /tmp exists. Given this information it is safe to say that the large files listed in the lsof above are likely the files being read & written to frequently.