lectures/06-os-level-isolation.org


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345

#+title: OS-Level Isolation
#+date: 2026-04-27 Mon
#+author: W. Kosior
#+email: wkosior@agh.edu.pl

* Hook-Based Isolation
- ptrace
  - User Mode Linux (UML)
    - UML as a proces (under Linux)
    - UML processes as ptrace'd host processes
    - mounting
  - non-isolation use-case: proot (mount + binfmt emulation)

* Hook-Based Isolation, Cont.
- ptrace
  - User Mode Linux (UML)
    - UML as a proces (under Linux)
    - UML processes as ptrace'd host processes
    - mounting
  - non-isolation use-case: proot (mount + binfmt emulation)
- library call hooks
  - Sandboxie (Windows)
    - untrusted process with zero-privilege user token
    - calls routed through Sandboxie
    - leverages many non-public Windows APIs
  - non-isolation use-case: fakeroot
    - *useful for security!* (support badly-written programs)

* Hook-Based Isolation — Summary
- userspace re-implementation of OS interfaces
  - easy to use / develop / test / deploy
- ptrace = significant overhead
  - more context switches
- sporadically useful

* =seccomp()=
- 2005 (Linux 2.6.12)
  - restrict process
    - only use =read()=, =write()=, =exit()= and =sigreturn()=
    - KILL non-complying process
- 2012 (Linux 3.5)
  - new =seccomp()= mode
  - filter process' syscalls
    - Berkeley Packet Filter
- initially: Chromium extension sandboxes
- container restrictions

* Containers Overview
- lightweight isolation
  - same kernel
  - (to some extent) separate userspace
- =chroot= idea extended
- most popular tooling: Docker + Linux kernel
- years earlier under other OS kernels
- Linux incarnation: namespaces

* From =chroot= to Containers
- 1979 — chroot
  - different view of =/= directory…
    - …otherwise, the same filesystem view
  - same processes view
  - same IPC view
  - same network view
  - …

* From =chroot= to Containers, Cont.
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
  - thick jails
    - traditional jail type
    - separate view of =/=, users, processes, etc.
    - optionally isolated network (VNET jail)
  - thin jails
    - can reuse a template for =/=
    - optionally isolated network (VNET jail)
  - service jails
    - separate view of processes
      - only jail processes visible inside
    - networking blocked or shared
    - shared view of =/=, users, etc.
  - optional Linux syscall emulation

* From =chroot= to Containers, Cont.
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2000 — Virtuozzo
  - patched Linux kernel
  - separate systems, one kernel
    - subdirs in common filesystem
    - chosen network interfaces or devices exposed
    - most of the rest isolated (distinct process trees, no shared IPC, etc.)
  - resource limits
  - 2005 — "OpenVZ", free/libre license

* From =chroot= to Containers, Cont…
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2004/2005 — Solaris Zones
  - also OpenSolaris, Illumos, SmartOS
  - separate systems, one kernel
    - global Zone (GZ) and non-global zones (NGZ)
    - shared or isolated networking
    - subdirs in common filesystem
    - most of the rest isolated (distinct process trees, no shared IPC, etc.)
  - optional resource caps
    - reportedly more effective than in OpenVZ
  - optional Linux syscalls emulation

* From =chroot= to Containers, Cont…
- 1979 — chroot (without steroids)
- 1999/2000 — FreeBSD jails
- 2004/2005 — Solaris Zones
- 2002-2014 — Linux namespaces & LXC
  - 2002 — first ns code (2.4.19 kernel)
  - 2013 — usable for security-oriented isolation (3.8 kernel)
  - 2014 — Linux Containers (LXC) 1.0
  - separate view of either
    - networking, user IDs, mounts, IPC, etc.
    - all of them at once
    - a subset
  - *cross-ns interaction*

* Linux Namespace Types
- mount
- UTS
- IPC
- PID
- network
- *user*
- time
- cgroup

* Namespaces — =/proc= Links
#+begin_example
$ ls //proc/self/ns/
cgroup             net                time               uts
ipc                pid                time_for_children
mnt                pid_for_children   user
#+end_example

* Namespaces — Operation
- =/proc/$PID/ns/…= ← ns links
  - bind-mountable elsewhere

* Namespaces — Operation, Cont.
- =/proc/$PID/ns/…= ← ns links
  - bind-mountable elsewhere
- =clone()= → create process in new ns
- =unshare()= → move current process to new ns
  - *exceptions*
- =setns()= → change process' ns
  - *exceptions*
  - ns link fd

* Namespaces — Operation, Cont…
- =/proc/$PID/ns/…= ← ns files
  - bind-mountable to 
- =clone()= → create process in new ns
- =setns()= → change process' ns
  - *exceptions*
- =unshare()= → move current process to new ns
  - *exceptions*
- =CAP_SYS_ADMIN= to clone
  - *exception*
- =CAP_SYS_ADMIN= *in target ns*
- last process death → ns death
  - unless: ns file bind-mounted
  - unless: ns file opened

* Mount Namespace
- 2002 (2.4.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- =mount -t tmpfs /foo=
  - effect seen by other processes
- unshare mount ns, then =mount -t tmpfs /foo= →
  - effect *not* seen by other processes
- sub-mounts propagation?
  - default off
- unmounting in ns?
  - filesystems mounted inside this ns?
    - OK
  - filesystems mounted *in ancestor ns*?
    - *nope*
    - hence the trick: =mount --bind /dev/null /etc/shadow=, then unshare
- =/proc/mounts= → *current ns*

* UTS Namespace
- 2006 (2.6.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- =sethostname()=, =getdomainname()=, etc.

* IPC Namespace
- 2006 (2.6.19 kernel)
- =CAP_SYS_ADMIN= to unshare
- sysvipc (numbers as keys)
  - =msgget()=
  - =semget()=
  - =shmget()=
  - =/proc/sys/kernel/msgmax=, etc.
  - =/proc/sysvipc=
- POSIX message queues (strings as names)
  - =mq_open()=
  - =/proc/sys/fs/mqueue=
- =/proc= files → *current ns*

* PID Namespace
- 2008 (2.6.24 kernel)
- =CAP_SYS_ADMIN= to unshare
- =setns()= / =unshare()= → ns for *children*
  - =/proc/self/ns/pid_for_children= vs =/proc/self/ns/pid=
  - *existing process cannot move*
- first ns process → PID 1 (init)
  - systemd *unsuitable*
  - SysV Init, runit, etc.
  - or… zombie apocalypse 🧟
  - init death → all ns processes death
    - ns /existing/ but unusable
- PID translation
- =/proc= dirs → *mounter's ns*

* Network Namespace
- 2008/2009 (2.6.24 - 2.6.29 kernels)
- =CAP_SYS_ADMIN= to unshare
- =/proc/sys/net= files → *current ns*
- use-case: entire network simulation
  - mininet
  - veth pairs / bridges

* User Namespace
- 2007/2013 (2.6.23 - 3.8 kernels)
- since 3.8: *unprivileged creation*
- separate view of
  - *users*
  - *capabilities*
  - keyrings
- unprivileged user creates ns → *root + all caps inside ns*
  - treated as unprivileged on the outside
  - can unshare other namespaces
- =/etc/subuid=, =/etc/subgid=
- =/proc/$PID/uid_map=, =/proc/$PID/gid_map=
- root outside == root inside?
  - practiced in the past (Docker, etc.)
  - unacceptable today

* Ubuntu & Debian Sysctl Patch (2013)
#+begin_example
add sysctl to disallow unprivileged CLONE_NEWUSER by default
This is a short-term patch. Unprivileged use of CLONE_NEWUSER
is certainly an intended feature of user namespaces. However
for at least saucy we want to make sure that, if any security
issues are found, we have a fail-safe.
Signed-off-by: Serge Hallyn <serge.hallyn@ubuntu.com>
#+end_example

* Kees Cook Sysctl Patch (2016)
#+begin_quote
There continues to be unexpected side-effects and security exposures via
CLONE_NEWUSER. For many end-users running distro kernels with CONFIG_USER_NS
enabled, there is no way to disable this feature when desired. As such, this
creates a sysctl to restrict CLONE_NEWUSER so admins not running containers or
Chrome can avoid the risks of this feature.
#+end_quote

* Cgroup Namespace
- 2016 (4.6 kernel)
- =/proc/$PID/cgroup=
  - =/proc= mounter's view of PID number
  - *current ns view of cgroup*
    - can read '..' in cgroup path
- =/sys/fs/cgroup= → *mounter's ns*
- prevents re-configuration of cgroups by guest

* Time Namespace
- 2020 (5.6 kernel)
- =CAP_SYS_ADMIN= to unshare
  - *unavailable in =clone()=*
- separate view of
  - =CLOCK_MONOTONIC=
  - =CLOCK_BOOTTIME=
  - *both unsettable*
- =/proc/$PID/timens_offset=
  - writeable before first process creation
- use-case: process migration

* Namespaces CLI Tools
- =unshare= command
  - =unshare --pid --fork readlink /proc/self= → 9610
  - =unshare --pid --fork --mount-proc readlink /proc/self= → 1
  - =unshare --user --keep-user --network --keep-caps=
    - =unshare -cn --keep-caps= ← shorter
    - spawn shell in new network ns
    - play with network interfaces without root!
    - privileges *not needed*

* Namespaces CLI Tools, Cont.
- =unshare= command
  - =unshare --pid --fork readlink /proc/self= → 9610
  - =unshare --pid --fork --mount-proc readlink /proc/self= → 1
  - =unshare --user --keep-user --network --keep-caps=
    - =unshare -cn --keep-caps= ← shorter
    - spawn shell in new network ns
    - play with network interfaces without root!
    - privileges *not needed*
- =nsenter= command
  - =nsenter --all --target=$PID_OF_MY_DOCKER_PROCESS sh=

* Bubblewrap Tool
#+begin_src shell-script
  bwrap  \
      --unshare-pid \
      --proc /proc --dev /dev \
      --ro-bind /usr /usr --ro-bind /etc /etc \
      --tmpfs /tmp --tmpfs "$HOME" \
      --chdir "$HOME" \
      --unshare-all \
      --clearenv --setenv PATH "$PATH" \
      bash
#+end_src

- Flatpak
- security sandboxes
- no predefined policies
- watch out for sockets

* Container Runtimes And Creation / Management Tools
- LXC (Linux Containers)
  - an early execution environment implementation
  - Docker default until 2014
- runC
  - Docker default runtime
- Podman
  - Docker API-comtpatible, daemonless
  - uses runC as well
- Singularity
- systemd-nspawn
- Nix containers
- Guix containers

* Docker Under Non-Linux Kernels
- Windows → Lnux VM
- macOS → Linux VM
- FreeBSD
  - Linux VM
  - Linuxulator (compatibility layer)
- Illumos