#+title: OS-Level Isolation
#+date: 2026-04-27 Mon
#+author: W. Kosior
#+email: wkosior@agh.edu.pl

* Hook-Based Isolation
- ptrace
  - User Mode Linux (UML)
    - UML as a proces (under Linux)
    - UML processes as ptrace'd host processes
    - mounting
  - non-isolation use-case: proot (mount + binfmt emulation)

* Hook-Based Isolation, Cont.
- ptrace
  - User Mode Linux (UML)
    - UML as a proces (under Linux)
    - UML processes as ptrace'd host processes
    - mounting
  - non-isolation use-case: proot (mount + binfmt emulation)
- library call hooks
  - Sandboxie (Windows)
    - untrusted process with zero-privilege user token
    - calls routed through Sandboxie
    - leverages many non-public Windows APIs
  - non-isolation use-case: fakeroot
    - *useful for security!* (support badly-written programs)

* Hook-Based Isolation — Summary
- userspace re-implementation of OS interfaces
  - easy to use / develop / test / deploy
- ptrace = significant overhead
  - more context switches
- sporadically useful

* =seccomp()=
- 2005 (Linux 2.6.12)
  - restrict process
    - only use =read()=, =write()=, =exit()= and =sigreturn()=
    - KILL non-complying process
- 2012 (Linux 3.5)
  - new =seccomp()= mode
  - filter process' syscalls
    - Berkeley Packet Filter
- initially: Chromium extension sandboxes
- container restrictions

* Containers Overview
- lightweight isolation
  - same kernel
  - (to some extent) separate userspace
- =chroot= idea extended
- most popular tooling: Docker + Linux kernel
- years earlier under other OS kernels
- Linux incarnation: namespaces

* From =chroot= to Containers
- 1979 — chroot
  - different view of =/= directory…
    - …otherwise, the same filesystem view
  - same processes view
  - same IPC view
  - same network view
  - …

* From =chroot= to Containers, Cont.
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
  - thick jails
    - traditional jail type
    - separate view of =/=, users, processes, etc.
    - optionally isolated network (VNET jail)
  - thin jails
    - can reuse a template for =/=
    - optionally isolated network (VNET jail)
  - service jails
    - separate view of processes
      - only jail processes visible inside
    - networking blocked or shared
    - shared view of =/=, users, etc.
  - optional Linux syscall emulation

* From =chroot= to Containers, Cont.
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2000 — Virtuozzo
  - patched Linux kernel
  - separate systems, one kernel
    - subdirs in common filesystem
    - chosen network interfaces or devices exposed
    - most of the rest isolated (distinct process trees, no shared IPC, etc.)
  - resource limits
  - 2005 — "OpenVZ", free/libre license

* From =chroot= to Containers, Cont…
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2004/2005 — Solaris Zones
  - also OpenSolaris, Illumos, SmartOS
  - separate systems, one kernel
    - global Zone (GZ) and non-global zones (NGZ)
    - shared or isolated networking
    - subdirs in common filesystem
    - most of the rest isolated (distinct process trees, no shared IPC, etc.)
  - optional resource caps
    - reportedly more effective than in OpenVZ
  - optional Linux syscalls emulation

* From =chroot= to Containers, Cont…
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2004/2005 — Solaris Zones
- 2002-2014 — Linux namespaces & LXC
  - 2002 — first ns code (2.4.19 kernel)
  - 2013 — usable for security-oriented isolation (3.8 kernel)
  - 2014 — Linux Containers (LXC) 1.0
  - separate view of either
    - networking, user IDs, mounts, IPC, etc.
    - all of them at once
    - a subset
  - *cross-ns interaction*

* Linux Namespace Types
- mount
- UTS
- IPC
- PID
- network
- *user*
- time
- cgroup

* Namespaces — =/proc= Links
#+begin_example
$ ls //proc/self/ns/
cgroup             net                time               uts
ipc                pid                time_for_children
mnt                pid_for_children   user
#+end_example

* Namespaces — Operation
- =/proc/$PID/ns/…= ← ns links
  - bind-mountable elsewhere

* Namespaces — Operation, Cont.
- =/proc/$PID/ns/…= ← ns links
  - bind-mountable elsewhere
- =clone()= → create process in new ns
- =unshare()= → move current process to new ns
  - *exceptions*
- =setns()= → change process' ns
  - *exceptions*
  - ns link fd

* Namespaces — Operation, Cont…
- =/proc/$PID/ns/…= ← ns files
  - bind-mountable to 
- =clone()= → create process in new ns
- =setns()= → change process' ns
  - *exceptions*
- =unshare()= → move current process to new ns
  - *exceptions*
- =CAP_SYS_ADMIN= to clone
  - *exception*
- =CAP_SYS_ADMIN= *in target ns*
- last process death → ns death
  - unless: ns file bind-mounted
  - unless: ns file opened

* Mount Namespace
- 2002 (2.4.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- =mount -t tmpfs /foo=
  - effect seen by other processes
- unshare mount ns, then =mount -t tmpfs /foo= →
  - effect *not* seen by other processes
- sub-mounts propagation?
  - default off
- unmounting in ns?
  - filesystems mounted inside this ns?
    - OK
  - filesystems mounted *in ancestor ns*?
    - *nope*
    - hence the trick: =mount --bind /dev/null /etc/shadow=, then unshare
- =/proc/mounts= → *current ns*

* UTS Namespace
- 2006 (2.6.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- =sethostname()=, =getdomainname()=, etc.

* IPC Namespace
- 2006 (2.6.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- sysvipc (numbers as keys)
  - =msgget()=
  - =semget()=
  - =shmget()=
  - =/proc/sys/kernel/msgmax=, etc.
  - =/proc/sysvipc=
- POSIX message queues (strings as names)
  - =mq_open()=
  - =/proc/sys/fs/mqueue=
- =/proc= files → *current ns*

* PID Namespace
- 2008 (2.6.24 kernel)
- =CAP_SYS_ADMIN= to unshare
- =setns()= / =unshare()= → ns for *children*
  - =/proc/self/ns/pid_for_children= vs =/proc/self/ns/pid=
  - *existing process cannot move*
- first ns process → PID 1 (init)
  - systemd *unsuitable*
  - SysV Init, runit, etc.
  - or… zombie apocalypse 🧟
  - init death → all ns processes death
    - ns /existing/ but unusable
- PID translation
- =/proc= dirs → *mounter's ns*

* Network Namespace
- 2008/2009 (2.6.24 - 2.6.29 kernels)
- =CAP_SYS_ADMIN= to unshare
- =/proc/sys/net= files → *current ns*
- use-case: entire network simulation
  - mininet
  - veth pairs / bridges

* User Namespace
- 2007/2013 (2.6.23 - 3.8 kernels)
- since 3.8: *unprivileged creation*
- separate view of
  - *users*
  - *capabilities*
  - keyrings
- unprivileged user creates ns → *root + all caps inside ns*
  - treated as unprivileged on the outside
  - can unshare other namespaces
- =/etc/subuid=, =/etc/subgid=
- =/proc/$PID/uid_map=, =/proc/$PID/gid_map=
- root outside == root inside?
  - practiced in the past (Docker, etc.)
  - unacceptable today

* Ubuntu & Debian Sysctl Patch (2013)
#+begin_example
add sysctl to disallow unprivileged CLONE_NEWUSER by default
This is a short-term patch. Unprivileged use of CLONE_NEWUSER
is certainly an intended feature of user namespaces. However
for at least saucy we want to make sure that, if any security
issues are found, we have a fail-safe.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
#+end_example

* Kees Cook Sysctl Patch (2016)
#+begin_quote
There continues to be unexpected side-effects and security exposures via
CLONE_NEWUSER. For many end-users running distro kernels with CONFIG_USER_NS
enabled, there is no way to disable this feature when desired. As such, this
creates a sysctl to restrict CLONE_NEWUSER so admins not running containers or
Chrome can avoid the risks of this feature.
#+end_quote

* Cgroup Namespace
- 2016 (4.6 kernel)
- =/proc/$PID/cgroup=
  - =/proc= mounter's view of PID number
  - *current ns view of cgroup*
    - can read '..' in cgroup path
- =/sys/fs/cgroup= → *mounter's ns*
- prevents re-configuration of cgroups by guest

* Time Namespace
- 2020 (5.6 kernel)
- =CAP_SYS_ADMIN= to unshare
  - *unavailable in =clone()=*
- separate view of
  - =CLOCK_MONOTONIC=
  - =CLOCK_BOOTTIME=
  - *both unsettable*
- =/proc/$PID/timens_offset=
  - writeable before first process creation
- use-case: process migration

* Namespaces CLI Tools
- =unshare= command
  - =unshare --pid --fork readlink /proc/self= → 9610
  - =unshare --pid --fork --mount-proc readlink /proc/self= → 1
  - =unshare --user --keep-user --network --keep-caps=
    - =unshare -cn --keep-caps= ← shorter
    - spawn shell in new network ns
    - play with network interfaces without root!
    - privileges *not needed*

* Namespaces CLI Tools, Cont.
- =unshare= command
  - =unshare --pid --fork readlink /proc/self= → 9610
  - =unshare --pid --fork --mount-proc readlink /proc/self= → 1
  - =unshare --user --keep-user --network --keep-caps=
    - =unshare -cn --keep-caps= ← shorter
    - spawn shell in new network ns
    - play with network interfaces without root!
    - privileges *not needed*
- =nsenter= command
  - =nsenter --all --target=$PID_OF_MY_DOCKER_PROCESS sh=

* Bubblewrap Tool
#+begin_src shell-script
  bwrap  \
      --unshare-pid \
      --proc /proc --dev /dev \
      --ro-bind /usr /usr --ro-bind /etc /etc \
      --tmpfs /tmp --tmpfs "$HOME" \
      --chdir "$HOME" \
      --unshare-all \
      --clearenv --setenv PATH "$PATH" \
      bash
#+end_src

- Flatpak
- security sandboxes
- no predefined policies
- watch out for sockets

* Container Runtimes And Creation / Management Tools
- LXC (Linux Containers)
  - an early execution environment implementation
  - Docker default until 2014
- runC
  - Docker default runtime
- Podman
  - Docker API-comtpatible, daemonless
  - uses runC as well
- Singularity
- systemd-nspawn
- Nix containers
- Guix containers

* Docker Under Non-Linux Kernels
- Windows → Lnux VM
- macOS → Linux VM
- FreeBSD
  - Linux VM
  - Linuxulator (compatibility layer)
- Illumos