Deep Dive into Linux and Docker Containers

The container market is in rapid evolution.

Over the years we have seen new technologies transforming data centers and cloud, and we can argue none have had faster adoption and impact than Linux containers. Today, we cannot talk about modern data center technologies or cloud without integrating and supporting Linux containers.

In order to understand their power, we need to understand the problems it solves and the technologies behind it.

Traditional virtualization

In traditional virtualization, there is a virtualization host with an operating system and a hypervisor which abstracts the physical hardware or handles using various techniques to interact with the physical hardware.

Each virtual machine (VM) or guest VM has a completed operating system installation. This means each VM has to undergo a full boot process to start the guest VM. This happens even if all VMs are identical or are linked to a master template. The boot time process leads to a delayed start of the services running in that VM and during node failures or scheduled boot times, causing stress to the underlying infrastructure handling these boot storms.

Another disadvantage of the VMs is that each hypervisor maintains its own VM format and mechanisms to abstract or provide access to the underlying hardware. This makes it difficult for a VM or virtual appliance to be portable across multiple hypervisors. So far, OEMs proving virtual appliances have to maintain multiple versions of the appliance increasing their development and maintenance costs.

Containers

One of the most impressive attributes of containers is that they have no boot time and we can run them almost anywhere without modification. Modern container technologies are agnostic to the underlying infrastructure. So how is this possible? What sorcery is this!?

In order to properly understand containers, we have to understand some capabilities that exist in any modern Linux kernel which have been adopted in other operating systems as well.

Back in 2002, Linux Kernel 2.4.19 introduced the concept of namespacesand the concept of control groups (cgroups). These sometime unknown capabilities of the Kernel are responsible for some of the greatest advances in modern data center and cloud technologies. Advances like tenant aware networking (think VMware NSX, Cisco ACI, OVS, OpenStack networking, etc.) and advanced isolation of VMs in hypervisors and containers. All these are possible because of the namespaces and cgroups features. So, let's dig in.

There are six types of namespaces:

PID namespace provides isolation for the allocation of process identifiers (PIDs), lists of processes and their details. While the new namespace is isolated from other siblings, processes in its "parent" namespace still see all processes in child namespaces—albeit with different PID numbers.
Network namespace isolates the network interface controllers (physical or virtual), iptables firewall rules, routing tables etc. Network namespaces can be connected with each other using the "veth" virtual Ethernet device.
"UTS" namespace allows changing the hostname.
Mount namespace allows creating a different file system layout, or making certain mount points read-only.
IPC namespace isolates the System V inter-process communication between namespaces.
User namespace isolates the user IDs between namespaces.

I like to describe namespaces as slices or parallel universes inside the Linux kernel. I'll explain this with some examples.

Let's take, as an example, the network namespaces. For a networking guy, I can say that a network namespace is equivalent to a super VRF, and it will click right away. For everyone else, or to explain the other namespaces, it is far easier to describe them as parallel universes.

Imagine you have a room with a table. This is our node. In one universe or namespace, there might be a simple vase with flowers on the table and a single individual walking around the room, in another universe, there might be food at the table and a group of friends having dinner. None of these universes or namespaces knows about the other but all of them have full access to the room and can use the table as they wish. This is what Linux namespaces in the kernel do for the containers. Namespaces create those parallel universes with full access to the node capabilities.

What if we want to restrict the use of the table in one of those universes? That is where control groups comes into the picture. Control groups (cgroups) are used for limiting, prioritization, accounting and controlling resources (i.e. memory, CPU, I/O, check pointing) for a collection of processes. Staying with the analogy, this means, we can restrict for one of those universes to be able to use only part of the table in the room, not the whole table, even when the whole table is available.

You may have noticed that cgroups act on a collection of processes. If we run the pstree command in a Linux or Mac we will see something similar to tree structure with branches.

In this hypothetical example, in line 6 we see process id 44 is running and contains various child processes. This is exactly what happens with a container. From the kernel's perspective, it is seen as a single process that is grouping multiple child processes. The namespaces and control groups are applied at this top-level process and affect everything underneath it.

Inside the containers, since they exist in their own universe, we only see the running child processes. Back to my previous analogy, the container only sees the vase with the flowers, not the food and dinner from another universe.

So, we can say: Containers = Group of Linux Processes + Namespaces + cgroups

These are the building blocks of containers but there are many container formats in the market. The Docker and RKT containers tend to be the most known ones.

Docker containers

The Docker container defines an open format for defining containers images and provide an execution environment.

If we want to demystify Docker containers format, it is a .tar file containing a series of directory structures containing the libraries required by the application and the description of the overlay filesystem with the layers that comprise a particular Docker image.

Each, but the top most layer, are immutable layers. Meaning, the libraries or configurations of each of those layers cannot be modified without rebuilding the whole container. With this portable image format, the applications are guaranteed to have exactly the version of the library and the exact configuration needed to run without having to worry about the node's operating system or having to modify anything on it. It also means that this container image can instantiate as many copies in as many places as we need and in each and every one of those, have identical results.

The Open Container Initiative (OCI) maintains the OCI Runtime Specification and OCI Image Specifications. The Runtime Specification outlines how to run a "filesystem bundle" that is unpacked on disk. The OCI Image Specification defines a format for encoding a container as a "filesystem bundle." This is in part why we can run unmodified Docker images directly in Linux, Mac and Windows.

Docker engine is one of the implementations of the OCI Runtime Specification that supports the execution of containers following the OCI Image Specification.

CRI-O is a lightweight container runtime for Kubernetes supporting the execution of containers following the OCI Image Specification.

Next in containers

The container market is in rapid evolution. The adoption of microservices architectures and the needs for operating containers at scale has seen the need for flexible container orchestration frameworks and techniques. Development cycles of these solutions are short as they follow agile methodologies producing noticeable improvements and new features in a matter of weeks.

Let's stay informed and let's keep learning about these new and exciting technologies. Comment below to keep the conversation going!