Docker Kernel Requirements
1. Introduction
Docker is a containerization technology that provides OS level virtualization to applications. It isolates processes, storage, networking, and also provide security to services running within it’s containers. To enable this, Docker depends on various features of the Linux Kernel. Let us get introduced to these Docker kernel requirements in this post.
2. Docker engine dependencies from the Linux kernel
The dependencies on the Linux kernel can be broadly categorized into 4 classes – resource constraining, security, networking, and storage. Resource constraining features allow container creators to place restrictions on container environments like memory usage, cpu, etc.,. Security features allow security policies to be applied on containers. Networking features allow for the SDN networking features provided by Docker. Storage features allow Docker to support volumes, and various storage backends.
Let us now examine each of these dependencies in brief.
3. Resource constraining dependencies
3.1 Control groups a.k.a cgroups
Control groups, or cgroups, is a kernel feature to constrain the resource usage of a process or a set of processes. This provides Docker with 4 main features:
- Limit resources (CPI, memory, network, disk I/O, …) to user-defined processes.
- Prioritize resources to processes (a set of processes will get more resources than another set).
- Measure resource usage for billing purposes.
- Control a group of processes.
The docker run
command is used to manipulate resources allocated to a container. For instance, docker run --cpu-shares=<value>
sets the cpu share allocated to a container (every container gets 1024 shares by default). docker run --cpuset-cpus=<value>
sets the CPU core on which the container would be run. Do look at this insightful article for some examples of manipulating cgroups settings for Docker containers.
3.2 Namespaces
Namespaces is a kernel feature that provides lightweight process virtualization to containers. This helps Docker to isolate these resources for a container – process IDs, hostnames, user IDs, network access, IPC and filesystems. Docker combines namespaces and cgroups to isolate resources for containers and place resource usage constraints. These namespaces are used to isolate containers – Process ID (pid), Network (net), Mount (mnt), Hostname (uts), Shared Memory (ipc).
- A pid namespace provides processes running within containers with separate pids isolated from other containers./li>
- A net namespace creates separate network interfaces, IP adrresses and such for each container.
- A mnt namespace creates isolated mounts for each container. Mount points from host OS may be carried into the container but any any additions to the container mounts are not propagated back to the host.
- An uts namespace creates containers with their own hostnames without affecting other containers or the rest of the system.
- An ipc namespace creates isolated shared memory space for each container and prevents access between shared memory of different computers.
4. Security dependencies
4.1 AppArmor
AppArmor is a Mandatory Access Control (MAC) tool to restrict programs to a limited set of resources. Restriction policies are set in a simple text file to administer storage, networking, capabilities of a program. A policy can run in enforcement or complain mode. A policy running in enforcement mode will enforce the policy and report violations. A policy running in complain mode will not enforce restrictions but only report violations.
Docker installs a default AppArmor profile – /etc/apparmor.d/docker
– during installation. This profile is applied to all Docker containers. To apply a specific AppArmor profile to a container use the option docker run -it <container-name> --security-opt=apparmor=<profile-name>
.
Read this page for more details about Docker’s usage of AppArmor.
4.2 Security Enhanced Linux a.k.a SELinux
SELinux, like AppArmor, enforces MAC policies on other subsystems of the Linux kernel. When compared to AppArmor, SELinux follows a more elaborate multi-level security policy control. This is currently developed and maintained by RedHat.
4.3 Posix capabilities a.k.a Capabilities
Capabilites as implemented in Linux (known as “Posix Capabilities“) partitions the root user’s privileges into distinct smaller units called “capabilities”. These capabilities are enabled/disabled as a unit and assigned to individual threads. This allows a thread/process to perform some privileged operation with a minimal set of capabilities but without assuming superuser permissions. See man capabilities
in any Linux system for more details on capabilities. Docker uses capabilities to restrict the actual capabilities of the container while providing all possible features to the service within it. A root user within a Docker container may not have all privileges as a root user in the actual host OS.
Read this post for more explanation on Docker’s support of capabilities.
4.4 Secure Computing Mode a.k.a seccomp
Secure Computing Mode, also called seccomp, provides a facility to place filters on the system calls available to a user-defined process. This is combined with other tools to provide a secure computing sandbox to filter a thread from all available system calls. When seccomp is applied to a thread, the thread can perform only 4 system calls – read(), write(), sigreturn() and exit(). The kernel will kill the process if it uses any other system call.
A seccomp profile is set to a Docker container with the security-opt
option of docker run
like so:
$ docker run -it <container-name> --security-opt <value>
Read this doc for more information about Docker’s use of seccomp.
5. Networking dependencies
5.1 Netfilter
Netfilter is a framework provided by the Linux Kernel that allows network packets flowing through the machine to be manipulated. Features include stateless and stateful packet filtering of IPv4 and IPv6 packets, Network address translation, port address translation, extensible APIs for 3rd party app developers. Docker uses Netfilter through it’s userspace counterpart IPTables.
5.2 IPTables
iptables is the user-space utility counterpart for netfilter. It interacts with netfilter and allows a system administrator to define tables of firewalling rules for packet filtering, network address translation (NAT), and so on. The Docker daemon automatically appends rules firewalling rules to iptables if it sees it installed in the system. For example when we expose a container’s port to the outside world Docker adds a corresponding rule to iptables. To disable iptables, start Docker daemon with the option iptables
set to false
like so:
$ dockerd --iptables=false
See this blog post for a few examples of how Docker uses iptables.
5.3 Netlink
Netlink as a tool provides a mechanism for communication between kernel and userspace components using a socket interface. Even userspace components can use this to communicate among one another. This is an alternative to ioctl and reduces dependence on direct system calls, ioctl calls, and such. Docker implements it’s netlink libraries to talk to the kernel’s netlink interface to create and configure network devices.
This excellent post has more details about how Docker uses Netlink.
6. File system dependencies
Docker supports several storage drivers with a plug-in architecture. One can choose a storage driver to run the Docker daemon with. However, Docker Engine can support only one active storage driver at a time. A change in the storage drive will need the Docker daemon to restart.
6.1 Device mapper
The devicemapper framework is provided by the kernel to map physical block devices as virtual devices devices. It provides the foundation for features such as logical volume management, device encryption, copy-on-write files, etc.,. Docker uses this framework to support copy-on-write files in containers.
7. Other non-kernel dependencies
7.1 LibContainer a.k.a RunC
libcontainer (Now called opencontainers/RunC) – This is not exactly a kernel feature. Docker developed this as an execution engine that exposes a consistent standardized Go API to work with Linux namespaces, cgroups, capabilities, AppArmor, security profiles, network interfaces, firewalls and firewalling rules. RunC has replaced LXC as the default execution driver of the Docker Engine.
7.2 LXC
Like libcontainer, LXC provides a userspace interface for the Linux Kernel’s container supporting features. LXC was the initial execution engine before Docker moved to RunC.
8. Summary
In this post we were introduced to the key kernel features on which Docker depends and builds to enable containerization. Each of the Kernel features in itself can be pursued further and understood more deeply to improve container security in Docker. The Docker docs also contain more information about kernel level enablers.