Kata Containers: When Containers and Virtual Machines Make a Baby
In March this year, we celebrated the 10-year anniversary of Docker, a technology that revolutionized the way we build and deploy applications. The adoption was explosive—so explosive that it’s extremely difficult to find a developer today who isn’t using containerization technology in some capacity.
The next evolution of containerization technology, Kata Containers, is still relatively unknown. To fully grasp the significance of its emergence, it’s essential to take a journey back to the roots of virtualization technology.
Xen Hypervisor & Hardware-Assisted Virtualization
In simple terms, virtualization allows a single physical computer to be divided into multiple virtual computers. This magic is achieved through a piece of software known as the hypervisor.
While the concept of virtualization dates to the late 1960s, significant advancements were made in the mid-2000s. In 2005/2006, Intel and AMD introduced hardware-assisted virtualization with VT-x and AMD-V technologies for the x86 architecture. This enabled virtual machines (VMs) to operate with minimal performance overhead. That same year, the Xen hypervisor incorporated VT-x and AMD-V, setting the stage for the rapid growth of public cloud platforms like AWS, GCP, Azure, and OCI.
As businesses increasingly adopted cloud services, the need for heightened security became paramount, especially when different clients shared the same physical hardware for their VMs. Hypervisors leverage hardware capabilities to ensure distinct separation between VMs. This means that even if one client’s VM experiences a crash or failure, others remain unaffected.
Going a step further, AMD introduced Secure Encrypted Virtualization (SEV) technology in 2017, enhancing VM isolation. With SEV, not only is data protected from other VMs, but it also introduces a “zero-trust” approach towards the hypervisor itself, ensuring that even the hypervisor cannot access encrypted VM data. This offers an added layer of protection in a shared-resource environment.
The Emergence of Containers
Virtual machines streamlined application deployment by abstracting the underlying hardware. However, they introduced new challenges. For instance, there was still the responsibility of maintaining the operating system (OS) on which the application ran. This involved configuring, updating, and patching security vulnerabilities. Furthermore, installing and configuring all the application’s dependencies remained a tedious task.
Containerization emerged as a solution to these challenges. Containers package the application together with its environment, dependencies, and configurations, ensuring consistency across deployments.
Recognizing the need to simplify OS maintenance further, AWS launched Fargate in 2017. With Fargate, developers can run containers on the cloud without the overhead of OS management. As the popularity of containerized applications surged, orchestrating these containers at scale became a challenge. This was effectively addressed by technologies like Kubernetes, which automate the deployment, scaling, and management of containerized applications.
Building on the Foundations
Having understood the intricate evolution of hardware capabilities over four decades, we appreciate how instrumental these advancements were in enhancing both performance and security for virtual machines. These developments have made it feasible to run multiple virtual machines on a single physical host without significant performance overhead while also maintaining strong isolation between them.
However, when it comes to containers, a different approach is taken. Unlike virtual machines, which rely heavily on these hardware-based virtualization features, containers don’t create whole separate virtualized hardware environments. Instead, they function within the same OS kernel and rely on built-in features of that kernel for their isolation.
At the heart of container isolation is a mechanism called namespaces. Introduced in the Linux kernel, namespaces effectively provide isolated views of system resources to processes. There are several types of namespaces in Linux, each responsible for isolating a particular set of system resources. For example:
- PID namespaces ensure that processes in different containers have separate process ID spaces, preventing them from seeing or signaling processes in other containers.
- Network namespaces give each container its own network stack, ensuring they can have their own private IP addresses and port numbers.
- Mount namespaces allow containers to have their own distinct set of mounted file systems.
And so on, for user, UTS, cgroup, and IPC namespaces.
The beauty of namespaces is their ability to provide a lightweight, efficient, and rapid isolation mechanism. This makes it possible for containers to start almost instantaneously and use minimal overhead, all while operating in isolated environments.
However, it’s essential to understand that while namespaces provide a degree of isolation, they don’t offer the same robust boundary that a virtual machine does with its separate kernel and often hardware-assisted barriers.
Kata Containers is Born
Building on the foundation of virtual machines and containers, Kata Containers emerged as a solution that seamlessly fuses the strengths of both worlds.
Traditional containers, with their reliance on kernel namespaces, bring unparalleled agility and efficiency. They can be spun up in fractions of a second and have minimal overhead, making them perfect for dynamic, scalable environments. On the other hand, virtual machines, backed by decades of hardware innovation, offer a more robust isolation boundary, giving a heightened sense of security, especially in multi-tenant environments.
Kata Containers seeks to bridge the gap between these two paradigms. At its core, Kata Containers provides a container runtime that integrates with popular container platforms like Kubernetes. But instead of relying solely on the kernel’s namespace for isolation, Kata Containers launches each container inside its lightweight virtual machine.
The below diagram demonstrates the difference between VM, Kata Containers, and conventional containers.
Introduction
Today, we’re addressing a common challenge for home lab users: how to make the most of our server with multiple cores and significant memory. Whether you’re running a Plex server, Kubernetes clusters, or VMs for development, a frequent issue is that these VMs are often hidden behind NAT, limiting their accessibility from other devices on your network.
To solve this, we will use two open source projects:
- QEMU/KVM: This hypervisor and virtual machine monitor lets us create and manage VMs.
- Open vSwitch (OVS): This is a software-based switch that enables more complex network configurations in virtualized environments than traditional switches.
Understanding the Problem
The main issue is that VMs operating behind NAT are not directly accessible from other machines on your home network, which can restrict your ability to interact with these VMs from other devices.
Implementing the Solution
Here’s how to configure your VMs to be accessible within your network using Open vSwitch and QEMU/KVM.
Step 1: Setting Up Open vSwitch
- Install Open vSwitch:
sudo apt-get install openvswitch-switch
- Create a Virtual Switch:
sudo ovs-vsctl add-br vm_net
- Verify the Bridge:
Ensure your newly created virtual bridge vm_net
is listed.
Step 2: Configuring Network Interface
We now link our network interface to the virtual bridge to allow VMs to communicate with the home network.
- Add Network Interface to the Bridge:
sudo ovs-vsctl add-port vm_net eth0
Replace eth0
with the correct identifier for your network interface.
- Check Configuration:
Make sure the network interface is correctly integrated with the bridge.
Step 3: Adjusting VM Network Settings
We need to ensure that the VMs utilize the Open vSwitch bridge for network communication.
- Update VM Network Config:
Adjust your VM’s network configuration to connect through the
vm_net
bridge:
<interface type="bridge">
<source bridge="vm_net"/>
<virtualport type="openvswitch"></virtualport>
<model type="e1000e"/>
</interface>
- Restart the VM:
sudo virsh start <vm-name>
Verifying the Setup
After these configurations, your VM should receive an IP address from your home DHCP server. Check the VM’s network details and try to ping the VM from another device in your network to ensure connectivity.
Conclusion
By integrating QEMU/KVM with Open vSwitch, you’ve overcome the NAT limitations, making your VMs fully accessible within your network. This configuration not only simplifies network management but also enhances the usability of your home lab.
If you prefer to consume this post as a video, I got you covered:
A year ago, I was exploring a few Kubernetes CNI plugins when I stumbled upon the Cilium project. Cilium uses eBPF and XDP for network traffic control, security, and visibility.
eBPF (Extended Berkeley Packet Filter) allows you to attach your code on the fly to almost any function within the Linux kernel. XDP (Xpress DataPath), on the other hand, enables manipulation of network traffic even before it reaches the network stack of the Linux kernel. Essentially, eBPF and XDP let you dynamically add logic to network traffic control while bypassing the kernel network stack potentially giving you better performance.
Although I initially considered utilizing these technologies to accelerate Kubernetes workloads using a DPU, a type of smart NIC, eventually I scrapped this XDP offload idea and went in a different direction, but the technology remained stuck in my head since then.
Fast forward to today, I decided to spend a weekend building a functional example that uses most of the basic building blocks of eBPF and XDP.
What the code does?
- User-space configures IP addresses to which the
ping
command should be blocked; this configuration can be adjusted on the fly.
- User-space gets notified once ICMP traffic hits the NIC.
How?
- Utilize libbpf to abstract away many of the repeating eBPF boilerplate code, simplifying the process of writing, loading, and managing the eBPF program.
- Establish communication between the user-space code and the eBPF program.
- Utilize an eBPF ring buffer for communication where the XDP will be the initiator.
- Use an eBPF hash map allowing user-space code to dynamically define which IPs should be blocked.
Let’s break down the main parts of the eBPF code.