docs/MMU-explained.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104

## MMU
Here's an explanation of steps we did to enable the MMU and how the MMU works in general.

MMU stands for Memory Management Unit. It does 2 important things:

1. It allows programs to use virtual memory addressing. Virtual addresses are translated by the MMU to physical addresses with the help of translation table.
2. It guards against unallowed memory access. Element that only implements this functionality is called MPU (Memory Protection Unit) and is also found in some ARM cores.

Without MMU code executing on a processor sees the memory as it really is. 

When it tries to load data from address 0x00AA0F3C it indeed loads data from 0x00AA0F3C. This doesn't mean address 0x00AA0F3C is in RAM: RAM can be mapped into the address space in an arbitrary way. 

MMU can be configured to "redirect" some range of addresses to some other range. Let's assume we configured the MMU to translate address range 0x00A00000 - 0x00B00000 to range 0x00200000 - 0x00300000. Now, code trying to perform operation on address 0x00AA0F3C would have the address transparently translated to 0x002A0F3C, on which the operation would actually take place.

The translation affects all (stack and non-stack) data accesses as well as instruction fetches, hence an entire program can be made to work as if it was running from some memory address, while in fact it runs from a different one!

The addresses used by program code are referred to as virtual addresses, while addresses actually used by the processor - as physical addresses.

This aids operating system's memory management in several ways

1. A program may by compiled to run from some fixed address and the OS is still free to choose any physical location to store that program's code - only a translation of program's required address to that location's address has to be configured. A problem of simultaneous execution of multiple programs compiled for the same address is also avoided in this way.
2. A consecutive memory region might be required by some program. For example: due to earlier allocations and deallocactions there isn't a big enough  (no pun intended) free consecutive region of physical memory. Smaller regions can be mapped to become accessible as a single region in virtual address space, thus avoiding the need for defragmentation.

A given mapping can be made valid for only one execution mode (i.e. region only accessible from privileged mode) or only certain types of accesses . A memory region can be made non-executable, which guards against accidental jumping there by program code. That is important for countering buffer-overflow exploits. An unallowed access triggers a processor exception, which passes control to an appropriate interrupt service routine.


In RaspberryPi environments used by us, there are ARMv7-A compatible processors, which we currently use only in 32-bit mode. Information here is relevant to those systems (there are Pi boards with both older and newer processors, with more or less functionality and features available).

If MMU is present, general configuration of it is done through registers of the appropriate coprocessor (cp15). Translations are managed through translation table. It is an array of 32-bit or 64-bit entries (also called descriptors) describing how their corresponding memory regions should be mapped. A number of leftmost bits of a virtual address constitutes an index into the translation table to be used for translating it. This way no virtual addresses need to be stored in the table and MMU can perform translations in O(1) time.

### Coprocessor 15
Coprocessor 15 contains several registers, that control the behaviour of the MMU. They are all accessed through mcr and mrc arm instructions.

1. SCTLR, System Control Register - "provides the top level control of the system, including its memory system". Bits of this register control, among other things, whether the following are enabled:

    1. the MMU
    2. data cache4. TEX remap
    3. instruction cache
    4. TEX remap (changes how some translation table entry bit fields (called C, B and TEX) are used - not in the project)
    5. access flags (enabling causes one translation table descriptor bit normally used to specify access permissions of a region to be used as access flag - not used either)

2. DACR, Domain Access Control Register - "defines the access permission for each of the sixteen memory domains". Entries in translation table define which of available 16 memory domains a memory region belongs to. Bits of DACR specify what permissions apply to each of the domains. Possible settings are to allow accesses to regions based on settings in translation table descriptor or to allow/disallow all accesses regardless of access permission bits in translation table.
   
3. TTBR0, Translation Table Base Register 0 - "holds the base address of translation table 0, and information about the memory it occupies".  System mode programmer can choose (with respect to some alignment requirements) where in the physical memory to put the translation table. Chosen address (actually, only a number of it's leftmost bits) has to be put in TTBR for the MMU to know where the table lies. Other bits of this register control some memory attributes relevant for accesses to table entries by the MMU
   
3. TTBR1, Translation Table Base Register 1 - simillar function to TTBR0 (see below for explaination of dual TTBR)
4. TTBCR, Translation Table Base Control Register, which controls:   

    1. How TLBs (Translation Lookaside Buffers) are used. TLBs are a mechanism of caching translation table entries.
    2. Whether to use some extension feature, that changes traslation table entries and TTBR* lengths to 64-bit (we're not using this, so we won't go into details)
    3. How a translation table is selected. 
    
There can be 2 translation tables and there are 2 cp15 registers (TTBR0 and TTBR1) to hold their base addresses. When 2 tables are in use, then on each memory access some leftmost bits of virtual address determine which one should be used. If the bits are all 0s - TTBR0-pointed table is used. Otherwise - TTBR1 is used. This allows OS developer to use separate translation tables for kernelspace and userspace (i.e. by having the kernelspace code run from virtual addresses starting with 1 and userspace code run from virtual addresses starting with 0). A field of TTBCR determines how many leftmost bits of virtual address are used for that (and also affects TTBR0 format). In the simplest setup (as in our project) this number is 0, so only the table specified in TTBR0 is used.

### Translation table

Translation table consists of 4096 entries, each describing a 1MB memory region. An entry can be of several types:

1. Invalid entry - the corresponding virtual addresses can not be used
2. Section - description of a mapping of 1MB memory region
3. Supersection - description of a mapping of 16MB memory region, that has to be repeated 16 times in consecutive memory sections . This can be used to map to physical addresses higher than 2^32.
4. Page table - no mapping is given yet, but a page table is pointed. See below.

Besides, translation table descriptor also specifies:

1. Access permissions.
2. Other memory attributes (cacheability, shareability).
3. Which domain the memory belongs to.

### Page Table

Page table is something simillar to translation table, but it's entries define smaller regions (called, well - pages). When a translation table descriptor describing a page table gets used for translation, then entry in that page table is fetched and used along with some middle bits of the virtual address used as index. This allows for better granularity of mappings, as it doesn't require the page tables to occupy space if small pages are not needed. We could say, that 2-level translations are performed. On some versions of ARM translations can have more levels than that. This means the MMU might sometimes need to fetch several entries from different level tables to compute the physical address. This is called a translation table walk.

As of 15.01.2020 page tables and small pages are not used in the project (although programming them is on the TODO list).

### Project specific info

Despite the overwhelming amount of configuration options available, most can be left deafult and this is how it's done in this project. Those default settings usually make the MMU behave like it did in older ARM versions, when some options were not yet available and hence, the entire system was simpler.

Our project uses C bitfield structs for operating on SCTLR and TTBCR contents and translation table descriptors. With DACR - bit shifts are more appropriate and with TTBCR - our default configuration means we're writing '0' to that register. This is an elegant and readable approach, yet little-portable across compilers. Current struct definitions work properly with GCC.

Structs describing SCTLR, DACR and TTBCR are defined in src/arm/PL1/kernel/cp_regs.h.
Structs describing translation table descriptors are defined in src/arm/PL1/kernel/translation_table_descriptors.h.

Before the MMU is enabled, all memory is seen as it really is. Therefore, the only feasible way of enabling it is by initially setting the descriptors in translation table to map all addresses (mapping just addresses used by the kernel would be enough) to themselves. It is called a flat map.

### Setting up MMU and FlatMap

How setting up a flat map and turning on the MMU and management of memory sections is done in our project:

1. Translation table is defined in the linker script src/arm/PL1/kernel/kernel_stage2.ld as a NOLOAD section. C code gets the table's start and end addresses from symbols defined in that linker script (see arm/PL1/kernel/memory.h).
2. Function setup_flat_map() defined in arm/PL1/kernel/paging.c enables MMU with a flat map. It prints relevant information to uart while performing the following procedure:

    1. In a loop write all descriptors to the translation table, set them as sections, accessible from PL1 only, belonging to domain 0.
    2. Set DACR to allow domain 0 memory accesses, based on translation table descriptor permissions and block accesses to other domains, as only domain 0 is used in this project.
    3. Make sure TEX remap, access flag, caches and the MMU are disabled in SCTLR. Disabling some of them might be unnecessary, because MMU is assumed to be disabled from the start and enabled caches might cause no problems as long as only flat map is used. Still, the way it is done right now is known to work well and optimizations are not needed.
    4. Clear all caches and TLBs (again, it is suspected that some of this is unnecessary).
    5. Write TTBCR setting such that only 32-bit translation table is used.
    6. Make TTBR0 point to the start of translation table. Rest of attributes in TTBR0 (concerning how table entries are being accessed) are left as 0s (defaults).
    7. Enable the MMU and caches by setting the appropriate bits in SCTLR.
    
After some cp15 register writes, the isb assembly instruction is used, which causes ARM core to wait until changes take effect. This is done to prevent some later instructions from being executed before the changes are applied.

In arm/PL1/kernel/paging.c the function claim_and_map_section() can be used to modify an entry in translation table to create a new mapping. Memory allocation also done in that source file uses some lists to describe free and taken sections, but has nothing to do with with the MMU.