#+title: OS-Level Isolation #+date: 2026-04-27 Mon #+author: W. Kosior #+email: wkosior@agh.edu.pl * Hook-Based Isolation - ptrace - User Mode Linux (UML) - UML as a proces (under Linux) - UML processes as ptrace'd host processes - mounting - non-isolation use-case: proot (mount + binfmt emulation) * Hook-Based Isolation, Cont. - ptrace - User Mode Linux (UML) - UML as a proces (under Linux) - UML processes as ptrace'd host processes - mounting - non-isolation use-case: proot (mount + binfmt emulation) - library call hooks - Sandboxie (Windows) - untrusted process with zero-privilege user token - calls routed through Sandboxie - leverages many non-public Windows APIs - non-isolation use-case: fakeroot - *useful for security!* (support badly-written programs) * Hook-Based Isolation — Summary - userspace re-implementation of OS interfaces - easy to use / develop / test / deploy - ptrace = significant overhead - more context switches - sporadically useful * =seccomp()= - 2005 (Linux 2.6.12) - restrict process - only use =read()=, =write()=, =exit()= and =sigreturn()= - KILL non-complying process - 2012 (Linux 3.5) - new =seccomp()= mode - filter process' syscalls - Berkeley Packet Filter - initially: Chromium extension sandboxes - container restrictions * Containers Overview - lightweight isolation - same kernel - (to some extent) separate userspace - =chroot= idea extended - most popular tooling: Docker + Linux kernel - years earlier under other OS kernels - Linux incarnation: namespaces * From =chroot= to Containers - 1979 — chroot - different view of =/= directory… - …otherwise, the same filesystem view - same processes view - same IPC view - same network view - … * From =chroot= to Containers, Cont. - 1979 — chroot (without steroids) - 1999/2000 — FreeBSD jails - thick jails - traditional jail type - separate view of =/=, users, processes, etc. - optionally isolated network (VNET jail) - thin jails - can reuse a template for =/= - optionally isolated network (VNET jail) - service jails - separate view of processes - only jail processes visible inside - networking blocked or shared - shared view of =/=, users, etc. - optional Linux syscall emulation * From =chroot= to Containers, Cont. - 1979 — chroot (without steroids) - 1999/2000 — FreeBSD jails - 2000 — Virtuozzo - patched Linux kernel - separate systems, one kernel - subdirs in common filesystem - chosen network interfaces or devices exposed - most of the rest isolated (distinct process trees, no shared IPC, etc.) - resource limits - 2005 — "OpenVZ", free/libre license * From =chroot= to Containers, Cont… - 1979 — chroot (without steroids) - 1999/2000 — FreeBSD jails - 2004/2005 — Solaris Zones - also OpenSolaris, Illumos, SmartOS - separate systems, one kernel - global Zone (GZ) and non-global zones (NGZ) - shared or isolated networking - subdirs in common filesystem - most of the rest isolated (distinct process trees, no shared IPC, etc.) - optional resource caps - reportedly more effective than in OpenVZ - optional Linux syscalls emulation * From =chroot= to Containers, Cont… - 1979 — chroot (without steroids) - 1999/2000 — FreeBSD jails - 2004/2005 — Solaris Zones - 2002-2014 — Linux namespaces & LXC - 2002 — first ns code (2.4.19 kernel) - 2013 — usable for security-oriented isolation (3.8 kernel) - 2014 — Linux Containers (LXC) 1.0 - separate view of either - networking, user IDs, mounts, IPC, etc. - all of them at once - a subset - *cross-ns interaction* * Linux Namespace Types - mount - UTS - IPC - PID - network - *user* - time - cgroup * Namespaces — =/proc= Links #+begin_example $ ls //proc/self/ns/ cgroup net time uts ipc pid time_for_children mnt pid_for_children user #+end_example * Namespaces — Operation - =/proc/$PID/ns/…= ← ns links - bind-mountable elsewhere * Namespaces — Operation, Cont. - =/proc/$PID/ns/…= ← ns links - bind-mountable elsewhere - =clone()= → create process in new ns - =unshare()= → move current process to new ns - *exceptions* - =setns()= → change process' ns - *exceptions* - ns link fd * Namespaces — Operation, Cont… - =/proc/$PID/ns/…= ← ns files - bind-mountable to - =clone()= → create process in new ns - =setns()= → change process' ns - *exceptions* - =unshare()= → move current process to new ns - *exceptions* - =CAP_SYS_ADMIN= to clone - *exception* - =CAP_SYS_ADMIN= *in target ns* - last process death → ns death - unless: ns file bind-mounted - unless: ns file opened * Mount Namespace - 2002 (2.4.19 kernel) - =CAP_SYS_ADMIN= to unshare - =mount -t tmpfs /foo= - effect seen by other processes - unshare mount ns, then =mount -t tmpfs /foo= → - effect *not* seen by other processes - sub-mounts propagation? - default off - unmounting in ns? - filesystems mounted inside this ns? - OK - filesystems mounted *in ancestor ns*? - *nope* - hence the trick: =mount --bind /dev/null /etc/shadow=, then unshare - =/proc/mounts= → *current ns* * UTS Namespace - 2006 (2.6.19 kernel) - =CAP_SYS_ADMIN= to unshare - =sethostname()=, =getdomainname()=, etc. * IPC Namespace - 2006 (2.6.19 kernel) - =CAP_SYS_ADMIN= to unshare - sysvipc (numbers as keys) - =msgget()= - =semget()= - =shmget()= - =/proc/sys/kernel/msgmax=, etc. - =/proc/sysvipc= - POSIX message queues (strings as names) - =mq_open()= - =/proc/sys/fs/mqueue= - =/proc= files → *current ns* * PID Namespace - 2008 (2.6.24 kernel) - =CAP_SYS_ADMIN= to unshare - =setns()= / =unshare()= → ns for *children* - =/proc/self/ns/pid_for_children= vs =/proc/self/ns/pid= - *existing process cannot move* - first ns process → PID 1 (init) - systemd *unsuitable* - SysV Init, runit, etc. - or… zombie apocalypse 🧟 - init death → all ns processes death - ns /existing/ but unusable - PID translation - =/proc= dirs → *mounter's ns* * Network Namespace - 2008/2009 (2.6.24 - 2.6.29 kernels) - =CAP_SYS_ADMIN= to unshare - =/proc/sys/net= files → *current ns* - use-case: entire network simulation - mininet - veth pairs / bridges * User Namespace - 2007/2013 (2.6.23 - 3.8 kernels) - since 3.8: *unprivileged creation* - separate view of - *users* - *capabilities* - keyrings - unprivileged user creates ns → *root + all caps inside ns* - treated as unprivileged on the outside - can unshare other namespaces - =/etc/subuid=, =/etc/subgid= - =/proc/$PID/uid_map=, =/proc/$PID/gid_map= - root outside == root inside? - practiced in the past (Docker, etc.) - unacceptable today * Ubuntu & Debian Sysctl Patch (2013) #+begin_example add sysctl to disallow unprivileged CLONE_NEWUSER by default This is a short-term patch. Unprivileged use of CLONE_NEWUSER is certainly an intended feature of user namespaces. However for at least saucy we want to make sure that, if any security issues are found, we have a fail-safe. Signed-off-by: Serge Hallyn #+end_example * Kees Cook Sysctl Patch (2016) #+begin_quote There continues to be unexpected side-effects and security exposures via CLONE_NEWUSER. For many end-users running distro kernels with CONFIG_USER_NS enabled, there is no way to disable this feature when desired. As such, this creates a sysctl to restrict CLONE_NEWUSER so admins not running containers or Chrome can avoid the risks of this feature. #+end_quote * Cgroup Namespace - 2016 (4.6 kernel) - =/proc/$PID/cgroup= - =/proc= mounter's view of PID number - *current ns view of cgroup* - can read '..' in cgroup path - =/sys/fs/cgroup= → *mounter's ns* - prevents re-configuration of cgroups by guest * Time Namespace - 2020 (5.6 kernel) - =CAP_SYS_ADMIN= to unshare - *unavailable in =clone()=* - separate view of - =CLOCK_MONOTONIC= - =CLOCK_BOOTTIME= - *both unsettable* - =/proc/$PID/timens_offset= - writeable before first process creation - use-case: process migration * Namespaces CLI Tools - =unshare= command - =unshare --pid --fork readlink /proc/self= → 9610 - =unshare --pid --fork --mount-proc readlink /proc/self= → 1 - =unshare --user --keep-user --network --keep-caps= - =unshare -cn --keep-caps= ← shorter - spawn shell in new network ns - play with network interfaces without root! - privileges *not needed* * Namespaces CLI Tools, Cont. - =unshare= command - =unshare --pid --fork readlink /proc/self= → 9610 - =unshare --pid --fork --mount-proc readlink /proc/self= → 1 - =unshare --user --keep-user --network --keep-caps= - =unshare -cn --keep-caps= ← shorter - spawn shell in new network ns - play with network interfaces without root! - privileges *not needed* - =nsenter= command - =nsenter --all --target=$PID_OF_MY_DOCKER_PROCESS sh= * Bubblewrap Tool #+begin_src shell-script bwrap \ --unshare-pid \ --proc /proc --dev /dev \ --ro-bind /usr /usr --ro-bind /etc /etc \ --tmpfs /tmp --tmpfs "$HOME" \ --chdir "$HOME" \ --unshare-all \ --clearenv --setenv PATH "$PATH" \ bash #+end_src - Flatpak - security sandboxes - no predefined policies - watch out for sockets * Container Runtimes And Creation / Management Tools - LXC (Linux Containers) - an early execution environment implementation - Docker default until 2014 - runC - Docker default runtime - Podman - Docker API-comtpatible, daemonless - uses runC as well - Singularity - systemd-nspawn - Nix containers - Guix containers * Docker Under Non-Linux Kernels - Windows → Lnux VM - macOS → Linux VM - FreeBSD - Linux VM - Linuxulator (compatibility layer) - Illumos