Docker

Docker Kernel Requirements

1. Introduction

Docker is a containerization technology that provides OS level virtualization to applications.  It isolates processes, storage, networking, and also provide security to services running within it’s containers.  To enable this, Docker depends on various features of the Linux Kernel.  Let us get introduced to these Docker kernel requirements in this post.

2. Docker engine dependencies from the Linux kernel

The dependencies on the Linux kernel can be broadly categorized into 4 classes – resource constraining, security, networking, and storage.  Resource constraining features allow container creators to place restrictions on container environments like memory usage, cpu, etc.,.  Security features allow security policies to be applied on containers.  Networking features allow for the SDN networking features provided by Docker.  Storage features allow Docker to support volumes, and various storage backends.

Kernel dependencies for Docker
Kernel dependencies for Docker

Let us now examine each of these dependencies in brief.

3. Resource constraining dependencies

3.1 Control groups a.k.a cgroups

Control groups, or cgroups, is a kernel feature to constrain the resource usage of a process or a set of processes.  This provides Docker with 4 main features:

  • Limit resources (CPI, memory, network, disk I/O, …) to user-defined processes.
  • Prioritize resources to processes (a set of processes will get more resources than another set).
  • Measure resource usage for billing purposes.
  • Control a group of processes.

The docker run command is used to manipulate resources allocated to a container. For instance, docker run --cpu-shares=<value> sets the cpu share allocated to a container (every container gets 1024 shares by default). docker run --cpuset-cpus=<value> sets the CPU core on which the container would be run. Do look at this insightful article for some examples of manipulating cgroups settings for Docker containers.

3.2 Namespaces

Namespaces is a kernel feature that provides lightweight process virtualization to containers.  This helps Docker to isolate these resources for a container – process IDs, hostnames, user IDs, network access, IPC and filesystems. Docker combines namespaces and cgroups to isolate resources for containers and place resource usage constraints.  These namespaces are used to isolate containers – Process ID (pid), Network (net), Mount (mnt), Hostname (uts), Shared Memory (ipc).

  • A pid namespace provides processes running within containers with separate pids isolated from other containers./li>
  • A net namespace creates separate network interfaces, IP adrresses and such for each container.
  • A mnt namespace creates isolated mounts for each container. Mount points from host OS may be carried into the container but any any additions to the container mounts are not propagated back to the host.
  • An uts namespace creates containers with their own hostnames without affecting other containers or the rest of the system.
  • An ipc namespace creates isolated shared memory space for each container and prevents access between shared memory of different computers.

4. Security dependencies

4.1 AppArmor

AppArmor is a Mandatory Access Control (MAC) tool to restrict programs to a limited set of resources.  Restriction policies are set in a simple text file to administer storage, networking, capabilities of a program.  A policy can run in enforcement or complain mode.  A policy running in enforcement mode will enforce the policy and report violations.  A policy running in complain mode will not enforce restrictions but only report violations.

Docker installs a default AppArmor profile – /etc/apparmor.d/docker – during installation.  This profile is applied to all Docker containers.  To apply a specific AppArmor profile to a container  use the option docker run -it <container-name> --security-opt=apparmor=<profile-name>.

Read this page for more details about Docker’s usage of AppArmor.

4.2 Security Enhanced Linux a.k.a SELinux

SELinux, like AppArmor, enforces MAC policies on other subsystems of the Linux kernel.  When compared to AppArmor, SELinux follows a more elaborate multi-level security policy control.  This is currently developed and maintained by RedHat.

4.3 Posix capabilities a.k.a Capabilities

Capabilites as implemented in Linux (known as “Posix Capabilities“) partitions the root user’s privileges into distinct smaller units called “capabilities”.  These capabilities are enabled/disabled as a unit and assigned to individual threads.  This allows a thread/process to perform some privileged operation with a minimal set of capabilities but without assuming superuser permissions.  See man capabilities in any Linux system for more details on capabilities.  Docker uses capabilities to restrict the actual capabilities of the container while providing all possible features to the service within it.  A root user within a Docker container may not have all privileges as a root user in the actual host OS.

Read this post for more explanation on Docker’s support of capabilities.

4.4 Secure Computing Mode a.k.a seccomp

Secure Computing Mode, also called seccomp, provides a facility to place filters on the system calls available to a user-defined process. This is combined with other tools to provide a secure computing sandbox to filter a thread from all available system calls.  When seccomp is applied to a thread, the thread can perform only 4 system calls – read(), write(), sigreturn() and exit().  The kernel will kill the process if it uses any other system call.

A seccomp profile is set to a Docker container with the security-opt option of docker run like so:

$ docker run -it <container-name> --security-opt <value>

Read this doc for more information about Docker’s use of seccomp.

5. Networking dependencies

5.1 Netfilter

Netfilter is a framework provided by the Linux Kernel that allows network packets flowing through the machine to be manipulated.  Features include stateless and stateful packet filtering of IPv4 and IPv6 packets, Network address translation, port address translation, extensible APIs for 3rd party app developers.  Docker uses Netfilter through it’s userspace counterpart IPTables.

5.2 IPTables

iptables is the user-space utility counterpart for netfilter. It interacts with netfilter and allows a system administrator to define tables of firewalling rules for packet filtering, network address translation (NAT), and so on.  The Docker daemon automatically appends rules firewalling rules to iptables if it sees it installed in the system.  For example when we expose a container’s port to the outside world Docker adds a corresponding rule to iptables.  To disable iptables, start Docker daemon with the option iptables set to false like so:

$ dockerd --iptables=false

See this blog post for a few examples of how Docker uses iptables.

5.3 Netlink

Netlink as a tool provides a mechanism for communication between kernel and userspace components using a socket interface.  Even userspace components can use this to communicate among one another.  This is an alternative to ioctl and reduces dependence on direct system calls, ioctl calls, and such.  Docker implements it’s netlink libraries to talk to the kernel’s netlink interface to create and configure network devices.

This excellent post has more details about how Docker uses Netlink.

6. File system dependencies

Docker supports several storage drivers with a plug-in architecture.  One can choose a storage driver to run the Docker daemon with.  However, Docker Engine can support only one active storage driver at a time.  A change in the storage drive will need the Docker daemon to restart.

6.1 Device mapper

The devicemapper framework is provided by the kernel to map physical block devices as virtual devices devices.  It provides the foundation for features such as logical volume management, device encryption, copy-on-write files, etc.,. Docker uses this framework to support copy-on-write files in containers.

7. Other non-kernel dependencies

7.1 LibContainer a.k.a RunC

libcontainer (Now called opencontainers/RunC) – This is not exactly a kernel feature. Docker developed this as an execution engine that exposes a consistent standardized Go API to work with Linux namespaces, cgroups, capabilities, AppArmor, security profiles, network interfaces, firewalls and firewalling rules.  RunC has replaced LXC as the default execution driver of the Docker Engine.

7.2 LXC

Like libcontainer, LXC provides a userspace interface for the Linux Kernel’s container supporting features.  LXC was the initial execution engine before Docker moved to RunC.

8. Summary

In this post we were introduced to the key kernel features on which Docker depends and builds to enable containerization.  Each of the Kernel features in itself can be pursued further and understood more deeply to improve container security in Docker.  The Docker docs also contain more information about kernel level enablers.

Hariharan Narayanan

Hari graduated from the School of Computer and Information Sciences in the University of Hyderabad. Over his career he has been involved in many complex projects in mobile applications, enterprise applications, distributed applications, micro-services, and other platforms and frameworks. He works as a consultant and is mainly involved with projects based on Java, C++ and Big Data technologies.
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Back to top button