How we were writing it, from the beginning
So, the goal is to learn and program the MMU of the RaspberryPI.
To be able to start quickly, we, errrmm... copy-pasted some code from RaspberryPI bare metal tutorial on wiki OSdev with plans to replace these bits with our own later. "An easy way" - one would say. Not really. The wiki code, although useful, is far from working. Consider this part:
enum
{
// The GPIO registers base address.
switch (raspi) {
case 2:
case 3: GPIO_BASE = 0x3F200000; break; // for raspi2 & 3
case 4: GPIO_BASE = 0xFE200000; break; // for raspi4
default: GPIO_BASE = 0x20200000; break; // for raspi1, raspi zero etc.
}
// more stuff here
};
Switch statement inside of an enum?! First thought? That it's some kind of nonstandard extension to C. Well, no. No matter how many own things gcc adds to the language, the ability to do THIS is not one of them. This is just one example.
It's really weird that someone wrote this and made it available online. There were also other, simillar problems with that code. I.e. there was double padding, which resulted in the initial routine being loaded at address 0x10000, not 0x8000...
Maybe the wiki OSdev authors don't actually want people to directly reuse their code or they want to stop inexperienced programmers from doing this kind of low-level bare-metal stuff too easily.
Nevertheless, we had to fix the bugs and then we could run the kernel.elf under qemu emulating the RPI2 and receive the (virtual) uart output.
The real hardware we have is RPI3B, but the versions of qemu available in some distributions don't yet have support for emulating RPI3 and compiling from source seemed like o good way to waste time we don't have (+ we were not going to use aarch64 until we get the 32-bit version working).
There were more problems with running the raw binary version of kernel (which we had to get working if we wanted to ever use real hardware). As it turned out after 1.5 hour of static analysis of the image in radare2 - qemu doesn't load the binary image at 0x8000 as a real RPI would, but rather at 0x10000. This migh also explain the need for double padding the wiki asm and linker code caused. After finding that out, we could finally understand how the ld script and objcopy work and we didn't need to use dd on the image anymore. We temporarily changed the entry point address to 0x10000.
We wrote few lines to check for paging support based on wiki info and we moved uart code into separate files. Then we came up with the idea of bootloader.
Once we would have started working on real RPI, we would have had to take the SD out of it, put into the pc, write the kernel to it and move it from PC back to the PI on every kernel compilation. Sounds terrible, doesn't it? And aside from that it takes a lot of time, it also kills the SD.
We decided to send the kernel through UART. In fact, there already exists a bootloader, called Raspbootin, that does exactly that. This time, however, we wanted to write this ourselves (especially that it seemed rather easy). So, the PI (or qemu) boots the loader instead of the actual kernel, the kernel (prepended with 4 bytes describing it's size) is piped through the uart (or to qemu's stdin), the loader writes the received data into the memory and jumps to it. One problem is that if loader gets loaded at 0x8000, then it cannot just write the kernel at 0x8000, because it would overwrite it's own code. That's why we made a "2nd stage" of bootloader, which is embedded in the main loader executable. The loader copies the stage2 to some other address, e.g. 0x4000 and jumps to it. The 2nd stage then initializes the uart, receives the kernel, writes it at 0x8000 and jumps to it.
By writing the bootloader we also removed the need to change the kernel entry address depending on the environment. The bootloader entry address has to bo changed instead, but this is less problematic, since the bootloader can just sit there on the SD card undisturbed, while we're working on the kernel.
We also finally ran the code on real hardware. And we used RPI Open Firmare for that (at last something it CAN be used for). Aside from kernel (which, btw., is supposed to be linux and has to ba named "zImage"), the firmawere expected 2 files (device tree, cmdline.txt). It also turned out to load the kernel at even different address. All this should be changed in the firmware itself TODO, but the original version is enough to work on for now.
Another problem appeared when trying to use the bootloader on real RPi. Mainly - how does one pipe the image through UART? GNU Screen we successfully used for communicating with the board doesn't seem to support this. We found a tool called "socat", which was available from the repo and could be used instead. So the makefile rule would first pipe the image using socat and then run screen for the usual io. An additional uart_getc() had to be added at the beginning of kernel main function, so that it's first output wouldn't get lost before screen would start. Socat solution also required the the UART USB adapter and PI's power supply to be replugged in a specific order to work, so we started working on a different solution using libRS232 for UART communication from the PS.
Only at this point we eventually started working on the relevant part - the MMU. It was surprisingly difficult to achieve something in this field. Information on the wiki was incomplete, just as information in the first few reference manuals picked. The source that eventually proved to be good enough is ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition. Also, the configuration of the MMU turned out to be way more complicated than first thought. After some hours of digging through dozens of options changed in various coprocessor registers we eventually came up with some code to enable the MMU, with a simple, (obvoiusly) flat mapping of memory and after finding out the bugs (forgetting to also map the part of memory where UART periphs are accessible, forgetting to mark the descriptor as describing section, creating the translation table at the same place the stack was) we got it working in qemu.
The above could be possibly achieved easier, by using others' existing code, but doing the whole project with the Copy-Paste method seemed like a bad idea.
Knowing a good, working sequence of actions needed for enabling the MMU, we could start writing it a cleaner way - using unions and structs with bitfields, which make the code a lot more readable compared to when bit masks and bit shifts are used for work on coprocessor register contents and translation tables entries.
The next step was switching to PL0 (unprivileged) mode under MMU-mapped address space. For that, we embedded in our kernel image a binary, that is supposed to run in PL0 mode. It did the same simple thing kernel used to do - echoing everything on uart. The privileged code would mark a memory section entry in translation table as accessible for PL0 and then copy the embedded blob to that section. It would then jump to that just-copied code. switching from PL1 to PL0 would be done on the blob side (at the very beginning of it's execution).
This was also an opportunity for us to check that the memory mapping truly works. We mapped virtual addresses 0xAAA00000 - 0xAAAFFFFF to physical addresses just after our kernel image and translation table (probably 0x00100000 - 0x001FFFFF given the small size of our kernel). The virtual address range 0xAAA00000 - 0xAAAFFFFF was also marked available for use by PL0 code. We then made kernel write the blob at 0x00100000 knowing it will also appear at virtual 0xAAA00000. Then, successfully running the unprivileged code from that address confirmed, that the mapping really works.
There were 2 important things forgetting about which would stop us from succeeding in this step. The first one was the stack. Kernel used to use the memory just below itself (physical 0x8000) as the stack and since these addresses would not be available for PL0 code, a new stack had to be chosen - we set it somewhere on the high part of our unprivileged memory. The second important thing was marking the section in which the memory-mapped uart registers reside as accessible form PL0. This is because we were going to have unprivileged code write to uart by itself (and use the same uart code kernel uses...). Once we had exceptions programmed, this demo was to be improved to actually call the privileged code for writing and reading.
The switch to unprivileged user mode was at that point done by the code loaded as user program. The goal was to have 100% of that code execute without privileges, so a short mode-swithing routine was separated to execute from it's own memory section (executable, but non-writable from PL0). We called that piece of code "libkernel" and embedded it in the actual kernel image.
We also introduced another blob in the kernel, which would ontain the exception vector table and some exception handlers and would be copied to address 0x0.
As so many pieces of code had to be embedded one in another, we hoped to find a cleaner way than current objcopy trick for achieving this. We came up with 2: The first one is using .incbin directive in assembly source, which we didn't do at that time. The second one is linking together pieces of code, that are supposed to work as separate programs (i.e. kernel and PL0 code), but puting their code in different elf sections, so that later, in runtime, such piece of code can be copied out and run from another location. -fPIC and later -fPIE was added to compile options to allow code pieces to run from different addresses than the kernel was compiled for. At first, that worked (for exception handlers and libkernel), but at some point something broke and after investigating we found out, that position intependent code relies on having a global offset table filled by some environment and it cannot work in a bare-metal case. All the changes with embedding had to be reverted. It is worth noting, that truly position independent code can be produced for arm, it's just not supported by most toolchains. Although the changes undertaken proved disastrous, when doing them we learned linker script syntax and started rewriting wiki.osdev-derived linker scripts' contents, in a more concise way, which was an important improvement to the project.
To otherwise cope with embedding, we implemented a very simple ram filesystem to be able to easily embed many files at once in the kernel.
We also found out, that switching from system mode to user mode is illegal in ARM. This explained some of the weird bugs we had. We then made the kernel use supervisor mode for most of the time. In fact, system mode ended up being used only to set the sp and lr of user mode (those 2 registers are shared between user and system modes).
The problem with having exception-related code at 0x0 pushed us to make the decision to split the kernel into 2 stages, just as with bootloader. We also wanted kernel and loader to be able to run from any address, and bare-metal position independent code could not be generated by the compiler, so we wrote 1st stages of compiler and loader in careful, fully-pic assembly, which succeeded splendidly. At that point, we also got rid of old assembly boot code taken from wiki.osdev. We then used .incbin to embed second stages of loader and kernel in their first stages, which, together with the inclusion of exception vector in kernel's stage2, reduced the need to use objcopy for embedding of code and simplified linking.
With all this dome, we could then extend and more easily test exception-handling routines. We implemented uart io of PL0 process in terms of supervisor call, which allowed us to make memory region with mapped peripherals unaccessible to unprivileged code as planned.
To make debugging easier, we wrote some functions for printing of numbers and strings. We then also added some basic utilities, like memcpy().
To known the memory size in runtime, we implemented handling of atags - a structure with information, that is passed to the kernel at boot. Our solution had to involve copying of entire atags from the initial location of 0x100 to some other, that would not be overwritten by stage 2 of the kernel (which gets copied to 0x0). Later, C code in kernel would parse the atags and get the ram size from it. No changes were required in bootloader, as it's second stage would be copied to 0x4000 and atags is quaranteed to end below that address.
Unfortunately, rpi-open-firmware doesn't pass atags to the kernel, so this feature would only be useful in qemu. To otherwise learn the memory size, flattened device tree would need to be parsed - something, that replaced atags in recent years. We did not, however, do it at that time.
We then wrote code to dynamically allocate, dealocate memory pages and made physical memory section for our only process be obtained dynamically. This feature would later be needed to implement multiple processes and their management.
After doing some minor changes like creating separate stack for supervisor, IRQ and FIQ, rewriting some peripheral addresses definitions or fixing a bug with non-aligned memory accesses in memcpy(), we moved to programming timer interrupts. To our current knowledge there are 3 timers available on the Pi. We first wanted to use the System Timer (one connected the GPU) and wrote some code to configure it and manage the interrupts. It turned out, the interrupts from that timer are not routed to ARM core under rpi-open-firmware. The GPU received the IRQ instead. We eventually settled for programming the ARM Timer (AP804-based one) and managed to get an IRQ on ARM core in the last minutes of 2019.
At the same time we starting investigating the uart and how to use it with interrupts.
To make use timer and uart IRQs, we needed some kind of process management. It took bit of thinking and trying different approaches until we wrote a (for now simple, yet extendable) one-process scheduler (it'd be reasonable to rename it to process manager), that integrated well with IRQ and svc handlers. Now, the process could be preempted with a timer interrupt and it's uart io was working in terms of interrupts, without peripheral registers polling.
At that point we rearranged files, as it was becoming pretty unreadable having over 50 files cluttered in one directory. Now we could start writing proper docs and modularize the project. We also refreshed our long-unused TODOs list. Another step to make the project cleaner was moving all products of compilation (.o, .elf, .img files) to a separate directory - build/. We came up with a 2-Makefile solution. The main Makefile is placed in build/ and the other one in project's root directory forwards all calls to it. However weird this solution is, thanks to it we avoided having to specify the build directory throughout all the Makefile rules (which, even with the use of variable, would be a bit messy).
We also managed to use patterns for big percent of rules in the main Makefile and make it more readable in general.
We then moved on to writing the proper documentation, which we decided to use markdown for. Finally, we also updated this diary.