aboutsummaryrefslogtreecommitdiff
path: root/README.md
blob: 56aa06c5ccc87fe3468936b2e370e224452e0cb6 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#sec-1">1. Building the project</a>
<ul>
<li><a href="#sec-1-1">1.1. Dependencies</a></li>
</ul>
</li>
<li><a href="#sec-2">2. Running</a>
<ul>
<li><a href="#sec-2-1">2.1. Running in Qemu</a></li>
<li><a href="#sec-2-2">2.2. Running on real hardware.</a></li>
</ul>
</li>
<li><a href="#sec-3">3. Makefile</a>
<ul>
<li><a href="#sec-3-1">3.1. Targets</a></li>
<li><a href="#sec-3-2">3.2. Aliased Rules</a></li>
</ul>
</li>
<li><a href="#sec-4">4. Project structure</a>
<ul>
<li><a href="#sec-4-1">4.1. Most significant directories and files</a></li>
</ul>
</li>
<li><a href="#sec-5">5. Boot Process</a>
<ul>
<li><a href="#sec-5-1">5.1. Loader</a></li>
<li><a href="#sec-5-2">5.2. Kernel</a>
<ul>
<li><a href="#sec-5-2-1">5.2.1. Stage 1</a></li>
<li><a href="#sec-5-2-2">5.2.2. Stage 2</a></li>
</ul>
</li>
<li><a href="#sec-5-3">5.3. Notes</a></li>
</ul>
</li>
<li><a href="#sec-6">6. MMU</a>
<ul>
<li><a href="#sec-6-1">6.1. Coprocessor 15</a></li>
<li><a href="#sec-6-2">6.2. Translation table</a></li>
<li><a href="#sec-6-3">6.3. Page Table</a></li>
<li><a href="#sec-6-4">6.4. Project specific information</a></li>
<li><a href="#sec-6-5">6.5. Setting up MMU and FlatMap</a></li>
</ul>
</li>
<li><a href="#sec-7">7. Program Status Register</a></li>
<li><a href="#sec-8">8. Ramfs</a>
<ul>
<li><a href="#sec-8-1">8.1. Specification</a></li>
<li><a href="#sec-8-2">8.2. Implementations</a></li>
</ul>
</li>
<li><a href="#sec-9">9. IRQ</a></li>
<li><a href="#sec-10">10. Processor modes</a></li>
<li><a href="#sec-11">11. Process management</a>
<ul>
<li><a href="#sec-11-1">11.1. Scheduler functions</a></li>
</ul>
</li>
<li><a href="#sec-12">12. Linking</a></li>
<li><a href="#sec-13">13. Miscellaneous topics</a>
<ul>
<li><a href="#sec-13-1">13.1. Supervisor calls</a></li>
<li><a href="#sec-13-2">13.2. Utilities</a></li>
<li><a href="#sec-13-3">13.3. Timers</a></li>
<li><a href="#sec-13-4">13.4. UARTs</a></li>
</ul>
</li>
<li><a href="#sec-14">14. Afterword</a></li>
<li><a href="#sec-15">15. Sources of Information</a></li>
</ul>
</div>
</div>


# Building the project<a id="sec-1" name="sec-1"></a>

## Dependencies<a id="sec-1-1" name="sec-1-1"></a>

1.  Native GCC (+ binutils)
2.  ARM cross-compiler GCC (+ binutils) (arm-none-eabi works - others
    might or might not)
3.  GNU Make
4.  rpi-open-firmware (for running on the Pi)
5.  GNU screen (for communicating with the kernel when running on the Pi)
6.  socat (for communicating with the bootloader when running on the Pi)
7.  Qemu ARM (for emulating the Pi).

For building rpi-open-firmware one will need more tools (not listed
here).

The project has been tested only in Qemu emulating Pi 2 and on real Pi 3 model B.

Running on Pis other than Pi 2 and Pi 3 is sure to require changing the definition in global.h (because peripheral base addresses differ between Pi versions) and might also require other modifications, not known at this time.

Assuming make, gcc, arm-none-eabi-gcc and its binutils are in the PATH, the kernel can be built with:

    $ make kernel.img

which is the same as:

    $ make

The bootloader can be built with:

    $ make loader.img

Both loader and kernel can then be found in build/

# Running<a id="sec-2" name="sec-2"></a>

## Running in Qemu<a id="sec-2-1" name="sec-2-1"></a>

To run the kernel (passed as elf file) in qemu:

    $ make qemu-elf

If You want to pass a binary image to qemu:

    $ make qemu-bin

To pass loader image to qemu and pipe kernel to it through emulated uart:

    $ make qemu-loader

With qemu-loader the kernel will run, but will be unable to receive any keyboard input.

The timer used by this project is the ARM timer ("based on an ARM
AP804", with registers mapped at 0x7E00B000 in the GPU address space).
It's absent in emulated environment, so no timer interrupts can be
witnessed in qemu.

## Running on real hardware.<a id="sec-2-2" name="sec-2-2"></a>

First, the rpi-open-firmware has to be built. Then, kernel.img (or
loader.img) should be copied to the SD card (next to bootcode.bin) and renamed to
zImage. Also, the .dtb file corresponding to the Pi model (actually, any .dtb
would do, it is not used right now) from stock firmware files has to be put to the SD
card and renamed as rpi.dtb. Finally, a cmdline.txt has to be present on the SD card
(content doesn't matter).

Now, RaspberryPi can be connected via UART to the development machine. GPIO on the Pi works
with 3.3V, so one should make sure, that UART device on the other end is
also working wih 3.3V. This is the pinout of the RaspberyPi 3 model B
that has been used for testing so far:

    Top left of the board is here
        |
        V
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+  
        | 2| 4| 6| 8|10|12|14|16|18|20|22|24|26|28|30|32|34|36|38|40|  
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+  
        | 1| 3| 5| 7| 9|11|13|15|17|19|21|23|25|27|29|31|33|35|37|39|  
        +--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+

Under rpi-open-firmware (stock firmware might map UARTs differently):

1.  pin 6 is Ground
2.  pin 8 is TX
3.  pin 10 is RX

Once UART is connected, the board can be powered on.

It is assumed, that USB to UART adapter is used and it is seen by the system as /dev/ttyUSB0.

If one copied the kernel to the SD card, they can start communicating
with the board by running:

    $ screen /dev/ttyUSB0 115200,cs8,-parenb,-cstopb,-hupcl

If one copied the loader, they can send it the kernel image and start
communicating with the system by running:

    $ make run-on-rpi

To run again, one can replug USB to UART adapter and Pi's power supply (order
matters!) and re-enter the command.

Running under stock firmware has not been performed. In particular, the
default configuration on RaspberryPi 3 seems to map other UART than used
by the kernel (so-called miniUART) to pins 6, 8 and 10. This is supposed
to be configurable through the use of overlays.

# Makefile<a id="sec-3" name="sec-3"></a>

 To maintain order, all files created with the use of make, that is binaries, object
files, natively executed helper programs, etc. get placed in build/.

Our project contains 2 Makefiles: one in it's root directory and one in
build/. The reason is that it is easier to use Makefile to simply,
elegantly and efficiently produce files in the same directory where it
is. To produce files in directory other than Makefile's own, it requires
this directory to be specified in many rules across the Makefile and in
general it complicates things. Also, a problem arises when trying to
link objects not from within the current directory. If an object is
referenced by name in linker script (which is a frequent practice in our
scripts) and is passed to gcc with a path, then it'd need to also appear
with that path in the linker script. Because of that a Makefile in
build/ is present, that produces files into it's own directory and the
Makefile in project's root is used as a proxy to that first one - it
calls make recursively in build/ with the same target it was called
with. These changes makes it easier to read.

From now on only Makefile in build/ will be discussed.

In the Makefile, variables with the names of certain tools and their
command line flags are defined (using =? assignment, which allows one to
specify their own value of that variable on the command line). In case a
cross-compiler with a different triple should be used, ARM\\<sub>BASE</sub>,
normally set to arm-none-eabi, can be set to something like
arm-linux-gnueabi or even /usr/local/bin/arm-none-eabi.

All variables discussed below are defined using := assignment, which
causes them to only be evaluated once instead of on every reference to
them.

Objects that should be linked together to create each of the .elf files
are listed in their respective variables. I.e. objects to be used for
creating kernel\\<sub>stage2</sub>.elf are all listed in KERNEL\\<sub>STAGE2\\</sub><sub>OBJECTS</sub>.
When adding a new source file to the kernel, it is enough to add it's
respective .o file to that list to make it compile and link properly. No
other Makefile modifications are needed. In a similar fashion,
RAMFS\\<sub>FILES</sub> variable specifies files, that should be put in the ramfs
image, that will be embedded in the kernel. Adding another file only
requires listing it there. However, if the file is to be found somewhere
else that build/, it might be useful to use the vpath directive to tell
make where to look for it.

Variables dirs and dirs\\<sub>colon</sub> are defined to store list of all
directories within src/, separated with spaces and colons, respectively.
dirs\\<sub>colons</sub> are used for vpath directive. 'dirs' variable is used in
ARM\\<sub>FLAGS</sub> to pass all the directories as include search paths to gcc.
empty and space are helper variables - defining dirs\\<sub>colon</sub> could be
achieved without them (but it's clearer this way).

The vpath directive tells make to look for assembler sources, C sources
and linker scripts in all direct and indirect subdirectories of src/
(including itself). All other files shall be found/created in build/.

## Targets<a id="sec-3-1" name="sec-3-1"></a>

The default target is the binary image of the kernel.

The generic rule for compiling C sources uses cross-compiler or native
compiler with appropriate flags depending on whether the source file is
located somewhere under arm/ directory (which lies in src/) or enywhere
else.

The generic rules for making a stripped binary image out of elf file,
for assembling an assembly file, for making an arbitrary file a linkable
object and for linking objects are ARM-only.

In C world it is possible to embed a file in an executable by using
objcopy to create an object file from it and then linking that object
file into the executable. In this project, at the current time, this is
used only for embedding ramfs in the kernel (incbin is used for
embedding kernel and loader second stages in their first stages).
Generic rule for making a binary image into object file is present, in
case it is needed somewhere else again.

To link elf files, the generic rule is combined with a rule that
specifies the elf's objects. Objects are listed in variables whenever
more than one of them is needed.

At this point in the Makefile, the dependence of objects created from
assembly on files referenced in the assembly source via incbin is
marked.

Simple ram filesystem is created from files it should contain with the
use of our own simple tool - makefs.

Another 2 rules specify how native programs (for the machine we're
working on) are to be linked.

## Aliased Rules<a id="sec-3-2" name="sec-3-2"></a>

Rule qemu-elf runs the kernel in qemu emulating RaspberryPi 2 with
256MiB of memory by passing the elf file of the kernel to the emulator.

Rule qemu-bin does the same, but passes the binary image of the kernel
to qemu.

Rule qemu-loader does the same, but first passes the binary image of the
bootloader to qemu and the actual kernel is piped to qemu's standard
input, received by bootloader as uart data and run. This method
currently makes it impossible to pass any keyboard input to kernel once
it's running.

Rule run-on-rpi pipes the kernel through uart, assuming it is available
under /dev/ttyUSB0, and then opens a screen session on that interface.
This allows for executing the kernel on the Pi connected through UART,
provided that our bootloader is running on the board.

Rule clean removes all the files generated in build/.

Rules that don't generate files are marked as PHONY.

# Project structure<a id="sec-4" name="sec-4"></a>

Directory structure of the project:

    doc/
    build/
          Makefile
    Makefile
    src/
        lib/
            rs232/
                  rs232.c
                  rs232.h
        host/
             pipe_image.c
             makefs.c
        arm/
            common/
                   svc_interface.h
                   strings.c
                   io.h
                   io.c
                   strings.h
            PL0/
                PL0_utils.h
                svc.S
                PL0_utils.c
                PL0_test.c
                PL0_test.ld
            PL1/
                loader/
                       loader_stage2.ld
                       loader_stage2.c
                       loader_stage1.S
                       loader.ld
                kernel/
                       demo_functionality.c
                       paging.h
                       setup.c
                       interrupts.h
                       interrupt_vector.S
                       kernel.ld
                       scheduler.h
                       atags.c
                       translation_table_descriptors.h
                       bcmclock.h
                       ramfs.c
                       kernel_stage1.S
                       paging.c
                       ramfs.h
                       interrupts.c
                       armclock.h
                       atags.h
                       kernel_stage2.ld
                       cp_regs.h
                       psr.h
                       scheduler.c
                       memory.h
                       demo_functionality.h
                PL1_common/
                           global.h
                           uart.h
                           uart.c

## Most significant directories and files<a id="sec-4-1" name="sec-4-1"></a>

doc/ Contains documentation of the project.

build/ Contains main Makefile of the project. All objects created during
the build process are placed there.

Makefile Proxies all calls to Makefile in build/.

src/ Contains all sources of the project.

src/host/ Contains sources of helper programs to be compiled using
native GCC and run on the machine where development takes place.

src/arm/ Contains sources to be compiled using ARM cross-compiler GCC
and run on the RaspberryPi.

src/arm/common Contains sources used in both: privileged mode and
unprivileged mode.

src/arm/PL0 Contains sources used exclusively in unprivileged, user-mode
(PL0) program, as well as the program's linker script.

src/arm/PL1 Contains sources used exclusively in privileged (PL1) mode.

src/arm/PL1/loader Contains sources used exclusively in the bootloader,
as well as linker scripts for stages 1 and 2 of this bootloader.

src/arm/PL1/kernel Contains sources used exclusively in the kernel, as
well as linker scripts for stages 1 and 2 of this kernel.

src/arm/PL1/PL1\\<sub>common</sub> Contains sources used in both: kernel and
bootloader.

TODOs Contains what the name suggests, in plain text. It lists things
that still can be implemented or improved, as well as tasks, that were
once listed and have since been completed (in which case they're marked
as done).

# Boot Process<a id="sec-5" name="sec-5"></a>

 When RaspberryPi boots, it searches the first
partition on SD card (which should be formatted FAT) for its firmware
and configuration files, loads them and executes them. The firmware then
searches for the kernel image file. The name of the looked for file can
be kernel.img, kernel7.img, kernel8.img (for 64-bit mode) or something
else, depending on configuration and firmware used (rpi-open-firmware
looks for zImage).

The image is then copied to some address and jumped to on all cores.
Address should be 0x8000 for 32-bit kernel, but in reality is 0x2000000
in rpi-open-firmware and 0x10000 in qemu (version 2.9.1). 3 arguments
are passed to the kernel: first (passed in r0) is 0; second (passed in
r1) is machine type; third (passed in r2) is the address of FDT or ATAGS
structure describing the system or 0 as default.

PIs that support aarch64 can also boot directly into 64-bit mode. Then,
the image gets loaded at 0x80000. We're not using 64-bit mode in this
project.

Qemu can be used to emulate RaspberryPi, in which case kernel image and
memory size are provided to the emulator on the command line. Qemu can
also load kernel in the form of an elf file, in which case its load
address is determined based on information in the elf.

Our kernel has been executed on qemu emulating RaspberryPi 2 as well as
on real RaspberryPi 3 running rpi-open firmware (although not every
functionality works everywhere).

## Loader<a id="sec-5-1" name="sec-5-1"></a>

To quicken running new images of the
kernel on the board, a simple bootloader has been written by us, which
can be run from the SD card instead of the actual kernel. It reads the
kernel image from uart, and executes it. The bootloader can also be used
within qemu, but there are several problems with passing keyboard input
to the kernel once it's running.

It is worth noting, that a project named raspbootin (<https://github.com/mrvn/raspbootin>) exists, which does a very simillar thing.
We did, however, choose to write our own bootloader, which we did.

Bootloader is split into 2 stages.

This is due to the fact, that the the actual kernel
read by it from UART is supposed to be written at 0x8000. If the loader
also ran from 0x8000 or a close address, it could possibly overwrite
it's own code while writing kernel to memory. To avoid this, the first
stage of the loader first copies its second stage embedded in it to
address 0x4000. Then, it jumps to that second stage, which reads kernel
image from uart, writes it at 0x8000 and jumps to it. Arguments (r0, r1,
r2) are preserved and passed to the kernel. Second stage of the
bootloader is intended to be kept small enough to fit between 0x4000 and
0x8000. Atags structure, if present, is guaranteed to end below 0x4000,
so it should not get overwritten by loader's stage2.

The loader protocol is simple: first, size of the kernel is sent through
UART (4 bytes, little endian). Then, the actual kernel image. Our
program pipe\\<sub>image</sub> is used to prepend kernel image with its size.

## Kernel<a id="sec-5-2" name="sec-5-2"></a>

The kernel is, just like bootloader, split into 2 stages.
It is desired to have image run from 0x0, because that's where the exception vector table is under default
settings. This was the main reason for splitting kernel into 2 parts.

### Stage 1<a id="sec-5-2-1" name="sec-5-2-1"></a>

 Stage 1 is loaded at some higher address. It has second stage
image embedded in it. It copies it to 0x0 and jumps to it. What gets
more complicated compared to loader, is the handling of ATAGS structure.
Before copying stage 2 to 0x0, stage 1 first checks if atags is present
and if so, it is copied to some location high enough, that it won't be
overwritten by stage 2 image. Whenever the memory layout is modified, it
should be checked, if there is a danger of ATAGS being overwritten by
some kernel operations before it is used. In current setup, new location
chosen for ATAGS is always below the memory later used as the stack and
it might overlap memory later used for translation table, which is not a
problem, since kernel only uses ATAGS before filling that table.

When stage 1 of the kernel jumps to second stage, it passes modified
arguments: first argument (r0) remains 0 if ATAGS was found and is set
to 3 to indicate, that ATAGS was not found. Second argument (r2) remains
unchanged. Third argument (r2) is the current address of ATAGS (or
remains unchanged if no ATAGS was found). If support for FDT is added in
the future, it must also be done carefully, so that FDT doesn't get
overwritten.

### Stage 2<a id="sec-5-2-2" name="sec-5-2-2"></a>

 At the start of the stage 2 of the kernel,
there is the interrupt vector table. It's first entry is the reset
vector, which is not normally unused. In our case, when stage 1 jumps to
0x0, first instruction of stage 2, it jumps to that vector, which then
calls the setup routine.

## Notes<a id="sec-5-3" name="sec-5-3"></a>

In both loader and the kernel, at the beginning of stage1 it is ensured,
that only one ARM core is executing.

It's worth noting, that in first stages the loop that copies the
embedded second stage is intentionally situated after the blob in the
image. This way, this loop will not overwrite itself with the data it is
copying, since the stage 2 is always copied to some lower address. It
copies to 0x0 in case of kernel and to 0x4000 in case of loader - we
assume stage 1 won't be loaded below 0x4000.

Qemu, stock RaspberryPi firmware and rpi-open-firmware all load image at
different addresses. Although stock firmware is not used in this
project, our loader loads kernel at 0x8000, where the stock firmware
would. Because of that, it is desired, that image is able to run,
regardless of where it was loaded at. This was realized by writing first
stages of loader and kernel in careful, position-independent assembly.
The starting address in corresponding linker scripts is irrelevant. The
stage 2 blobs are embedded using .incbin assembly directive. Second
stages are written normally in C and compiled as position-dependent for
their respective addresses.

# MMU<a id="sec-6" name="sec-6"></a>

Here's an explanation of steps we did to enable the MMU and how the MMU
works in general.

MMU stands for Memory Management Unit. It does 2 important things:

1.  It allows programs to use virtual memory addressing. Virtual
    addresses are translated by the MMU to physical addresses with the
    help of translation table.
2.  It guards against unallowed memory access. Element that only
    implements this functionality is called MPU (Memory Protection Unit)
    and is also found in some ARM cores.

Without MMU code executing on a processor sees the memory as it really
is.

When it tries to load data from address 0x00AA0F3C it indeed loads data
from 0x00AA0F3C. This doesn't mean address 0x00AA0F3C is in RAM: RAM can
be mapped into the address space in an arbitrary way.

MMU can be configured to "redirect" some range of addresses to some
other range. Let's assume we configured the MMU to translate address
range 0x00A00000 - 0x00B00000 to range 0x00200000 - 0x00300000. Now,
code trying to perform operation on address 0x00AA0F3C would have the
address transparently translated to 0x002A0F3C, on which the operation
would actually take place.

The translation affects all (stack and non-stack) data accesses as well
as instruction fetches, hence an entire program can be made to work as
if it was running from some memory address, while in fact it runs from a
different one!

The addresses used by program code are referred to as virtual addresses,
while addresses actually used by the processor - as physical addresses.

This aids operating system's memory management in several ways

1.  A program may by compiled to run from some fixed address and the OS
    is still free to choose any physical location to store that program's
    code - only a translation of program's required address to that
    location's address has to be configured. A problem of simultaneous
    execution of multiple programs compiled for the same address is also
    avoided in this way.
2.  A consecutive memory region might be required by some program. For
    example: due to earlier allocations and deallocactions there isn't a
    big enough (no pun intended) free consecutive region of physical
    memory. Smaller regions can be mapped to become accessible as a
    single region in virtual address space, thus avoiding the need for
    defragmentation.

A given mapping can be made valid for only one execution mode (i.e.
region only accessible from privileged mode) or only certain types of
accesses . A memory region can be made non-executable, which guards
against accidental jumping there by program code. That is important for
countering buffer-overflow exploits. An unallowed access triggers a
processor exception, which passes control to an appropriate interrupt
service routine.

In RaspberryPi environments used by us, there are ARMv7-A compatible
processors, which we currently use only in 32-bit mode. Information here
is relevant to those systems (there are Pi boards with both older and
newer processors, with more or less functionality and features
available).

If MMU is present, general configuration of it is done through registers
of the appropriate coprocessor (cp15). Translations are managed through
translation table. It is an array of 32-bit or 64-bit entries (also
called descriptors) describing how their corresponding memory regions
should be mapped. A number of leftmost bits of a virtual address
constitutes an index into the translation table to be used for
translating it. This way no virtual addresses need to be stored in the
table and MMU can perform translations in O(1) time.

## Coprocessor 15<a id="sec-6-1" name="sec-6-1"></a>

Coprocessor 15 contains several registers, that control the behaviour of
the MMU. They are all accessed through mcr and mrc arm instructions.

1.  SCTLR, System Control Register - "provides the top level control of
    the system, including its memory system". Bits of this register
    control, among other things, whether the following are enabled:
    1.  the MMU
    2.  data cache4. TEX remap
    3.  instruction cache
    4.  TEX remap (changes how some translation table entry bit fields
        (called C, B and TEX) are used - not in the project)
    5.  access flags (enabling causes one translation table descriptor bit
        normally used to specify access permissions of a region to be used
        as access flag - not used either)

2.  DACR, Domain Access Control Register - "defines the access permission
    for each of the sixteen memory domains". Entries in translation table
    define which of available 16 memory domains a memory region belongs
    to. Bits of DACR specify what permissions apply to each of the
    domains. Possible settings are to allow accesses to regions based on
    settings in translation table descriptor or to allow/disallow all
    accesses regardless of access permission bits in translation table.

3.  TTBR0, Translation Table Base Register 0 - "holds the base address of
    translation table 0, and information about the memory it occupies".
    System mode programmer can choose (with respect to some alignment
    requirements) where in the physical memory to put the translation
    table. Chosen address (actually, only a number of it's leftmost bits)
    has to be put in TTBR for the MMU to know where the table lies. Other
    bits of this register control some memory attributes relevant for
    accesses to table entries by the MMU

4.  TTBR1, Translation Table Base Register 1 - simillar function to TTBR0
    (see below for explaination of dual TTBR)
5.  TTBCR, Translation Table Base Control Register, which controls:
    1.  How TLBs (Translation Lookaside Buffers) are used. TLBs are a
        mechanism of caching translation table entries.
    2.  Whether to use some extension feature, that changes traslation
        table entries and TTBR\* lengths to 64-bit (we're not using this,
        so we won't go into details)
    3.  How a translation table is selected.

There can be 2 translation tables and there are 2 cp15 registers (TTBR0
and TTBR1) to hold their base addresses. When 2 tables are in use, then
on each memory access some leftmost bits of virtual address determine
which one should be used. If the bits are all 0s - TTBR0-pointed table
is used. Otherwise - TTBR1 is used. This allows OS developer to use
separate translation tables for kernelspace and userspace (i.e. by
having the kernelspace code run from virtual addresses starting with 1
and userspace code run from virtual addresses starting with 0). A field
of TTBCR determines how many leftmost bits of virtual address are used
for that (and also affects TTBR0 format). In the simplest setup (as in
our project) this number is 0, so only the table specified in TTBR0 is
used.

## Translation table<a id="sec-6-2" name="sec-6-2"></a>

Translation table consists of 4096 entries, each describing a 1MB memory
region. An entry can be of several types:

1.  Invalid entry - the corresponding virtual addresses can not be used
2.  Section - description of a mapping of 1MB memory region
3.  Supersection - description of a mapping of 16MB memory region, that
    has to be repeated 16 times in consecutive memory sections . This can
    be used to map to physical addresses higher than 2\\<sup>32</sup>.
4.  Page table - no mapping is given yet, but a page table is pointed.
    See below.

Besides, translation table descriptor also specifies:

1.  Access permissions.
2.  Other memory attributes (cacheability, shareability).
3.  Which domain the memory belongs to.

## Page Table<a id="sec-6-3" name="sec-6-3"></a>

Page table is something simillar to translation table, but it's entries
define smaller regions (called, well - pages). When a translation table
descriptor describing a page table gets used for translation, then entry
in that page table is fetched and used along with some middle bits of
the virtual address used as index. This allows for better granularity of
mappings, as it doesn't require the page tables to occupy space if small
pages are not needed. We could say, that 2-level translations are
performed. On some versions of ARM translations can have more levels
than that. This means the MMU might sometimes need to fetch several
entries from different level tables to compute the physical address.
This is called a translation table walk.

As of 15.01.2020 page tables and small pages are not used in the project
(although programming them is on the TODO list).

## Project specific information<a id="sec-6-4" name="sec-6-4"></a>

Despite the overwhelming amount of configuration options available, most
can be left deafult and this is how it's done in this project. Those
default settings usually make the MMU behave like it did in older ARM
versions, when some options were not yet available and hence, the entire
system was simpler.

Our project uses C bitfield structs for operating on SCTLR and TTBCR
contents and translation table descriptors. With DACR - bit shifts are
more appropriate and with TTBCR - our default configuration means we're
writing '0' to that register. This is an elegant and readable approach,
yet little-portable across compilers. Current struct definitions work
properly with GCC.

Structs describing SCTLR, DACR and TTBCR are defined in
src/arm/PL1/kernel/cp\\<sub>regs</sub>.h. Structs describing translation table
descriptors are defined in
src/arm/PL1/kernel/translation\\<sub>table\\</sub><sub>descriptors</sub>.h.

Before the MMU is enabled, all memory is seen as it really is.
Therefore, the only feasible way of enabling it is by initially setting
the descriptors in translation table to map all addresses (mapping just
addresses used by the kernel would be enough) to themselves. It is
called a flat map.

## Setting up MMU and FlatMap<a id="sec-6-5" name="sec-6-5"></a>

How setting up a flat map and turning on the MMU and management of
memory sections is done in our project:

1.  Translation table is defined in the linker script
    src/arm/PL1/kernel/kernel\\<sub>stage2</sub>.ld as a NOLOAD section. C code gets
    the table's start and end addresses from symbols defined in that
    linker script (see arm/PL1/kernel/memory.h).
2.  Function setup\\<sub>flat\\</sub><sub>map</sub>() defined in arm/PL1/kernel/paging.c
    enables MMU with a flat map. It prints relevant information to uart
    while performing the following procedure:
    1.  In a loop write all descriptors to the translation table, set them
        as sections, accessible from PL1 only, belonging to domain 0.
    2.  Set DACR to allow domain 0 memory accesses, based on translation
        table descriptor permissions and block accesses to other domains,
        as only domain 0 is used in this project.
    3.  Make sure TEX remap, access flag, caches and the MMU are disabled
        in SCTLR. Disabling some of them might be unnecessary, because MMU
        is assumed to be disabled from the start and enabled caches might
        cause no problems as long as only flat map is used. Still, the way
        it is done right now is known to work well and optimizations are
        not needed.
    4.  Clear all caches and TLBs (again, it is suspected that some of
        this is unnecessary).
    5.  Write TTBCR setting such that only 32-bit translation table is
        used.
    6.  Make TTBR0 point to the start of translation table. Rest of
        attributes in TTBR0 (concerning how table entries are being
        accessed) are left as 0s (defaults).
    7.  Enable the MMU and caches by setting the appropriate bits in
        SCTLR.

After some cp15 register writes, the isb assembly instruction is used,
which causes ARM core to wait until changes take effect. This is done to
prevent some later instructions from being executed before the changes
are applied.

In arm/PL1/kernel/paging.c the function claim\\<sub>and\\</sub><sub>map\\</sub><sub>section</sub>() can
be used to modify an entry in translation table to create a new mapping.
Memory allocation also done in that source file uses some lists to
describe free and taken sections, but has nothing to do with with the
MMU.

# Program Status Register<a id="sec-7" name="sec-7"></a>

CPSR (Current Program Status Register) is a register, bits of which contain and/or determine various aspects of
execution, i.e. condition flags, execution state (arm, thumb or
jazelle), endianness state, execution mode and interrupt mask. This register is readable and writeable with
the use of mrs and msr instructions from any PL1 mode, thus it is
possible to change things like mode or interrupt mask by writing to this
register.

Additionally, there are other registers with the same or simillar bit
fields as CPSR. Those PSRs (Program Status Registers) are:

1.  APSR (Application Program Status Register)
2.  SPSRs (Saved Program Status Registers)

APSR is can be considered the same as CPSR or a view of CPSR, with some
limitations - some bit fields from CPSR are missing (reserved) in APSR.
APSR can be accessed from PL0, while CPSR should only be accessed from
PL1. This was an application program executing in user mode can learn
some of the settings in CPSR without accessing CPSR directly.

SPSR is used for exception handling. Each exception-taking mode has it's
own SPSR (they can be called SPSR\\<sub>sup</sub>, SPSR\\<sub>irq</sub>, etc.). On exception
entry, old contents of CPSR are backed up in entered mode's SPSR.
Instructions used for exception return (subs and ldm \\^), when writing
to the pc, have the important additional effect of copying the SPSR to
CPSR. This way, on return from an exception, processor returns to the
state from before the exception. That includes endianess settings,
execution state, etc.

In our project, the structure of PSRs is defined in terms of C bitfield
structs in src/arm/PL1/kernel/psr.h.

# Ramfs<a id="sec-8" name="sec-8"></a>

A simple ram file system has been introduced to avoid having to embed
too many files in the kernel in the future.

The ram filesystem is created on the development machine and then
embedded into the kernel. Kernel can then parse the ramfs and access
files in it.

Ramfs contains a mapping from file's name to it's size and contents.
Directories, file permissions, etc. as well as writing to filesystem are
not supported.

Currently this is used to access the code of PL0 test program by the
kernel, which it then copies to the appropriate memory location. In case
more user mode programs are later written, they can all be added to
ramfs to enable the kernel to access them easily.

## Specification<a id="sec-8-1" name="sec-8-1"></a>

When ramfs is accessed in memory, it MUST be aligned to a multiple of 4.

The filesystem itself consists of blocks of data, each containing one
file. Blocks of data in the ramfs come one after another, with the
requirement, that each block starts at a 4-aligned offset/address. If a
block doesn't end at a 4-aligned address, there shall be up to 3
null-bytes of padding after it, so that the next block is properly
aligned.

Each block start with a C (null-terminated) string with the name of the
file it contains. At the first 4-aligned offset after the string, file
size is stored on 4 bytes in little endian. Null-bytes are used for
padding between file name and file size if necessary. Immediately after
the file size reside file contents, that take exactly the amount of
bytes specified in file size.

As obvious from the specification, files bigger than 4GB are not
supported, which is not a problem in the case of this project.

## Implementations<a id="sec-8-2" name="sec-8-2"></a>

Creation of ramfs is done by the makefs program (src/host/makefs.c). The
program accepts file names as command line arguments, creates a ramfs
containing all those files and writes it to stdout. As makefs is a very
simple tool (just as our ramfs is a simple format), it puts files in
ramfs under the names it got on the command line. No stripping or
normalizing of path is performed. In case of errors (i.e. io errors)
makefs prints information to stderr and exits.

Parsing/reading of ramfs is done by a kernel driver
(src/arm/PL1/kernel/ramfs.c). The driver allows for finding a file in
ramfs by name. File size and pointers to file name string and file
contents are returned through a structure from function find\\<sub>file</sub>.

As ramfs is embedded in kernel image, it is easily accessible to kernel
code. The alignment of ramfs to a multiple of 4 is assured in kernel's
linker script (src/arm/PL1/kernel/kernel\\<sub>stage2</sub>.ld). ## Exceptions
Whenever some illegal operation (attempt to execute undefined
instruction, attempt to access memory with insufficient permission,
etc.) happens or some peripheral device "messages" the ARM core, that
something important happened, an exception occurs. Exception is
something, that pauses normal execution and passes control to the
(specific part of) operating system. Upon an exception, several things
happen:

1.  Change of proocessor mode.
2.  CPSR gets saved into new mode's [SPSR](./PSRs-explained.txt).
3.  pc (incremented by some value) is saved into new mode's lr.
4.  Execution jumps to an entry in the exception vectors table specific
    to the exception.

Each exception type is taken to it's specific mode. Types and their
modes are:

1.  Reset and supervisor mode.
2.  Undefined instruction and undefined mode.
3.  Supervisor call and supervisor mode.
4.  Prefetch abort and abort mode.
5.  Data abort and abort mode.
6.  Hypervisor trap and hypervisor mode (not used normally, only with
    extensions).
7.  IRQ and IRQ mode.
8.  FIQ and FIQ mode.

The new value of the pc (the address, to which the exception "jumps") is
the address of nth instruction from exceptiom base address, which, under
simplest settings, is 0x0 (bottom of virtual address space). N depends on the exception type. It is:

1.  reset
2.  undefined instruction
3.  supervisor call
4.  prefetch abort
5.  data abort
6.  hypervisor trap (not used here)
7.  IRQ
8.  FIQ

Those 8 instructions constitute the exception vectors table. As the
instruction follow one another, each of them should be a branch to some
exception-handling routine. In fact, on other architectures often the
exception vector table holds raw addresses of where to jump instead of
actual instructions, as here.

Bottom of virtual address space can be changed to some other value by
manipulating the contents of SCTLR and VBAR coprocessor registers.

On exception entry, the registers r0-r12 contain values used by the code
that was executing before. In order for the exception handler to perform
some action and return to that code, those registered can be preserved
in memory. Some compilers can automatically generate appropriate
prologue and epilogue for handler-functions, that will preserve the
right registers (we're not using this feature in our project).

Having old CPSR in SPSR and old pc in lr is helpful, when after handling
the exception, the handler needs to return to the code that was
executing before. There are 2 special instructions, subs and ldm \\^
(load multiple with a dash \\^), that, when used to change the pc (and
therefore perform a jump) cause the SPSR to be copied into CPSR. As bits
of CPSR determine the current execution mode, this causes the mode to be
change to that from before the exception. In short, subs and ldm \\^ are
the instructions to use to return from exceptions.

As noted eariler, upon exception entry an incremented value of pc is
stored in lr. By how much it is incremented, depends on exception type
and execution state. For example, entering undefined instruction
exception for thumb state places in undef's lr the problematic
instruction's address + 2, while taking this exception from ARM state
places in undef's lr that instruction's address + 4 (see full table in
paragraph B1.8.3 of [ARMv7-ar\\<sub>arm</sub>](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf)).

It's worth noting, that while our
implementation of exception handlers also sets the stack pointer (sp) upon each
exception entry, a kernel could be written, where this wouldn't be done,
as each mode enterable by exception has it's own sp.

# IRQ<a id="sec-9" name="sec-9"></a>

  2 of out of all possible exceptions in ARM are IRQ (Interrupt Request) and FIQ (Fast
  Interrupt Request). The can be caused by external source, such as
peripheral devices and they can be used to inform the kernel about some
action, that happened.

Interrupts offer an economic way of interacting with peripheral devices.
For example, code can probe UART memory-mapped registers in a loop to
see whether transmitting/receiving of a character finished. However,
this causes the processor needlessly execute the loop and makes it
impossible or difficult to perform another tasks at the same time.
Interrupt can be used instead of probing to "notify" the kernel, that
something it was waiting for just happened. While waiting for interrupt,
the system can be put to halt (i.e. wfi instruction), which helps save
power, or it can perform other actions without wasting processor cycles
in a loop.

An interrupt, that is normally IRQ, can be made into FIQ by ARM system
dependent means. FIQ is meant to be able to be handled faster, by not
having to back up registers r8-r12, that FIQ mode has it's own copies
of. This project only uses IRQ.

Some peripheral devices can be configured (through their memory-mapped
registers) to generate an interrupt under certain conditions (i.e. UART
can generate interrupt when received characters queue fills). The
interrupt can then be either masked or unmasked (sometimes in more than
one peripheral register). If interrupts are enabled in CPSR and a
peripheral device tries to generate one, that is not masked, IRQ (or
FIQ) exception occurs (which causes interrupts to be temporarily masked
in CPSR). The code can usually check, whether an interrupt of given kind
from given device is **pending**, by looking at the appropriate bit of the
appropriate peripheral register (mmio). As long as an interrupt is
pending, re-enabling interrupts (for example via return from IRQ
handler) shall cause the exception to occur again. Removing the source
of the interrupt (i.e. removing characters from UART fifo, that filled)
doesn't usually cause the interrupt to stop pending, in which case a
pending-bit has to be cleared, usually by writing to the appropriate
peripheral register (mmio).

IRQs and FIQs can be configured as vectored - the processor then, upon
interrupt, jumps to different location depending on which interrupt
occured, instead of jumping to the standard IRQ/FIQ vector. This can be used
to speed up interrupt handling. Our simple project does not, however,
use this feature.

Currently, IRQs from 2 sources are used:
[ARM timer IRQ](https://www.raspberrypi.org/app/uploads/2012/02/BCM2835-ARM-Peripherals.pdf) and UART IRQs. The kernel makes sure, that timer IRQ only
occurs when processor is in user mode. IRQ handler does not return in
this case - it calls scheduler. The kernel makes sure, that UART IRQ
only occurs, when a process is blocked and is waiting for UART IO
operation. The interrupt handler, when called, checks what type of UART
action happened and tries (through calling of appropriate function from
scheduler.c) to handle that action and, possibly, to unblock the waiting
process. UART IRQ might occur when another process is executing (not
possible now, with only one process, but shall be possible when more
processes are added to the project), in which case it the handler
returns, or when kernel is explicitly waiting for interrupts (because
all processes are blocked), in which case it calls schedule() instead of
returning. 

# Processor modes<a id="sec-10" name="sec-10"></a>

ARMv7-A core can be executing in one of several modes (not to be
confused with instruction set states or endianness execution state).
Those are:

1.  User
2.  FIQ
3.  IRQ
4.  Supervisor
5.  Abort
6.  Undefined
7.  System

In fact, there are more if the processor implements some extensions, but
this is irrelevant here.

Current processor mode is encoded in the lowest five bits of the CPSR register.

Processor can operate in one of 2 privilege levels (although, again,
extensions exist, that add more levels):

1.  PL0 - privilege level 0
2.  PL1 - privilege level 1

Processor modes have their assigned privilege levels. User mode has
privilege level 0 and all other modes have privilege level 1. Code
executing in one of privileged modes is allowed to do more things, than
user mode code, i.e. writing and reading some of the coprocessor
registers, executing some privileged instructions (i.e. mrs and msr,
when used to reference CPSR, as well as other modes' registers),
accessing privileged memory and changing the mode (without causing an
interrupt). Attempts to perform those actions in user mode result either
in undefined (within some limits) behaviour or an exception (depending
on what action is considered).

User mode is the one, in which application programs usually run. Other
modes are usually used by the operating system's kernel. Lack of
privileges in user mode allows PL1 code to control execution of PL0
code.

While code executing in PL1 can freely (except switching from system to
user mode, which produces undefined behaviour) change mode by either
writing the CPRS or executing cps instruction, user mode can only be
exitted by means of an interrupt.

Some ARM core registers (i.e. r0 - r7) are shared between modes, while
some are not. In this case, separate modes have their private copies of
those registers. For example, lr and sp in supervisor mode are different
from lr and sp in user mode. For full information about shared and not
shared (banked) registers, see paragraph B9.2.1 in
[armv7-a
manual](https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf). The most important things are that user mode and system mode
share all registers with each other and they don't have their own SPSR
(which is used for returning from exceptions and exceptions are
never taken to those 2 modes) and that all other modes have their own
SPSR, sp and lr.

The reason for having multiple copies of the same register in different
modes is that it simplifies writing interrupt handlers. I.e. supervisor
mode code can safely use sp and lr without destroying the contents of
user mode's sp and lr.

The big number of PL1 modes is supposed to aid in handling of
interrupts. Each kind of interrupt is taken to it's specific mode.

Supervisor mode, in addition to being the mode supervisor calls are
taken to, is the mode the processor is in when the kernel boots.

System mode, which uses the same registers as user mode, is said to have
been added to ARM architecture to ease accessing the unprivileged
registers. For example, setting user mode's sp from supervisor mode can
be done by switching to system mode, setting the sp and switching back
to supervisor mode. Other modes' registers can alternatively be accessed
with the use of mrs and msr assembly instructions (but not from user
mode).

Despite the name, system mode doesn't have to be the mode used most
often by operating system's kernel. In fact, prohibition of direct
switching from system mode to user mode would make extensive use of
system mode impractical. This project, for example, uses supervisor mode
for most of the privileged tasks.

# Process management<a id="sec-11" name="sec-11"></a>

  An operating system has
  to manage user processes. Our system only has one process right now, but
usual actions, such as context saving or context restoring, are
implemented anyways. The following few paragraphs contain information on
how process management looks like in operating systems in general.

Process might return control to the system by executing the svc (eariler
called swi) instruction. System would then perform some action on behalf
of the process and either return from the supervisor call exception or
attempt to schedule another process to run, in which case context of the
old process would need to be saved for later and context of the new
process would need to be restored.

Process has data in memory (such as it's stack, code) as well as data in
registers (r0-r15, CPSR). Together they constitute process' context.
From process' perspective, context should not unexpectedly change, so
when control is taken away from user mode code (via an exception) and
later (possibly after execution of some other processes) given back, it
should be transparent to the process (except when kernel does something
for the process in terms of supervisor call). In particular, the
contents of core registers should be the same as before. For this to be
achievable, the operating system has to back up process' registers
somewhere in memory and later restore them from that memory.

Operating system kernel maitains a queue of processes waiting for
execution. When a process blocks (for example by waiting for IO), it is
removed from the queue. If a process unblocks (for example because IO
completed) it is added back to the queue. In general, some systems might
complicate it, for example by having more queues, but discussing those
variations is out of scope of this documentation. When processor is
free, one of the processes from the queue (determined by some scheduling
algorithm implemented in the kernel) gets
chosen and run on the processor.

As one process could never use a supervisor call, it could occupy the
processor forever. To remedy this, timer interrupts can be used by the
kernel to interrupt the execution of a process after some time. The
process would then have it's context saved and go to the end of the
queue. Another process would be scheduled to run.

Other exceptions might occur when process is running. Depending on
kernel design, handler of an exception (such as IRQ) might return to the
process or cause another one to be scheduled.

If at some time all processes are blocked waiting, the kernel can wait
for some interrupt to happen, which could possibly unblock some process
(i.e. because IO completed).

While not mentioned earlier, switching between processes' contexts
involves not only saving and restoring of registers, but also changing
the translation table entries to properly map memory regions used by
current process.

In our project, process management is implemented in
src/arm/PL1/kernel/scheduler.c.

A "queue" contains data of the only process (variables PL0\\<sub>regs[]</sub>,
PL0\\<sub>sp</sub>, PL0\\<sub>lr</sub> and PL0\\<sub>PSR</sub>).

## Scheduler functions<a id="sec-11-1" name="sec-11-1"></a>

Function setup\\<sub>scheduler\\</sub><sub>structures</sub> is supposed to be called before
scheduler is used in any way.

Function schedule\\<sub>new</sub>() creates and runs a new process.

Function schedule\\<sub>wait\\</sub><sub>for\\</sub><sub>output</sub>() causes the current process to
have it's context saved and get blocked waiting for UART to send data.
It is called from supervisor call handler. Function
schedule\\<sub>wait\\</sub><sub>for\\</sub><sub>input</sub>() is similar, but process waits for UART to
receive data.

Function schedule() attempts to select a process (currently the only
one) and run it. If process cannot be run, schedule() waits for
interrupt, that could unblock the process. The interrupt handler would
not return in this case, but rather call schedule() again.

Function scheduler\\<sub>try\\</sub><sub>output</sub>() is supposed to be called by IRQ
handler when UART is ready to transmit more data. It can cause a process
to get unblocked. scheduler\\<sub>try\\</sub><sub>input</sub>() is simillar, but relates to
receiving data.

The following are assured in our design:

1.  When processor is in user mode, interrupts are enabled.
2.  When processor is in system mode, interrupts are disabled, except
    when explicitly waiting for the interrupt when process is blocked.
3.  When a process is waiting for input/output, the corresponding IRQ is
    unmasked. Otherwise, that IRQ is masked.
4.  If an interrupt from UART occurs during execution of user mode code
    (not possible here, as we only have one process, but shall become
    possible when proper processes are implemented), the handler shall
    return. If that interrupt occurs during execution of PL1 code, it
    means it occured in scheduler, that was implicitly waiting for it and
    the handler calls scheduler() again instead of returning.
5.  Interrupt from timer is unmasked and set to come whenever a process
    gets scheduled to run. Timer interrupt is disabled when in PL1 (when
    scheduler is waiting for interrupt, only UART one can come).
6.  A supervisor call requesting an UART operation, that can not be
    completed immediately, causes the process to block.

# Linking<a id="sec-12" name="sec-12"></a>

[Linking](https://en.wikipedia.org/wiki/Linker_%28computing%29) is a process of creating an executable, library or another
object file out of object files.
During linking, values previously unknown to the compiler (i.e. what
will be the addresses of external functions/variables, from what address
will the code be executing) might be injected into the code.

Linker script is, among others, used to tell the linker, where in memory
the specific parts of the executable should lie.

In a hosted environment (when building a program to run under an
full-featured operting system, like GNU/Linux), a linker script is
usually provided by the toolchain and used if no other script is
provided. In a bare-metal project, the developer usually has to write
their own linker script, in which they specify the binary image's **load
address** and section layout.

Contents of an object code file or executable (our .o or .elf) are
grouped into sections. Sections have names. Common named are .text
(usually contains code), .data (usually contains statically-allocated
variables initialized to non-zero values), .bss (usually used to reserve
memory for statically allocated variables initialized to zero), .rodata
(usually contains statically-allocated variables, that are not going to
be modified).

In a hosted environment, when an executable (say, of elf format) is
executed, contents of it's sections are usually placed in different
memory segments with different access privileges, so that, for example,
code is not writable and variable contents are not executable. This
helps reduce the risk of buffer overflow exploits.

In a bare-environment like ours, we don't execute an elf file directly
(except in qemu, which is the unpreferred approach anyway), but rather a
raw binary image created from an elf file. Still, the notion of section
is used along the way.

During link, one or more object code files are combined into one file
(in our case an executable). Section contents of input files land in
some sections of the output file, in a way defined in the linker script.
In a hosted environment, a linker script would likely put contents of
input .text sections in a .text section, contents of input .data
sections in a .data section, etc. The developer can, however, use
sections with different names (although weird behaviour of some linkers
might occur) and assign their contents in their preferred way using a
linker script.

In linker script it is possible to specify a section as NOLOAD (usually
used for .bss), which, in our case, causes that section not to be
included in the binary image later created with objcopy.

It is also possible to treat same-named input sections differently
depending on what file they came from and even use wildcards when
specifying file names.

Variables can be created, as well as new symbols, which can then be
references from C code.

Defining alignment of specific parts of future image is also easily
achievable.

We made use of all those possibilities in our scripts.

In src/arm/PL1/kernel/kernel\\<sub>stage2</sub>.ld the physical memory layout of
thkernel is defined. Symbols defined there, such as \\<sub>stack\\</sub><sub>end</sub>, are
referenced in C header src/arm/PL1/kernel/memory.h.

While src/arm/PL1/kernel/kernel.ld and src/arm/PL1/loader/loader.ld
define the starting address, it is irrelevant, as the assembly-written
position-independent code for first stages of loader and kernel does not depend on that address.

At the beginning of this project, we had very little understanding of
linker scripts' syntax.
[This article](https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/4/html/Using_ld_the_GNU_Linker/sections.html#OUTPUT-SECTION-DESCRIPTION) proved useful and allowed us to learn the required parts in a
short time. As discussing the entire syntax of linker scripts is beyond
the scope of this documentation, we refer the reader to that resource.

# Miscellaneous topics<a id="sec-13" name="sec-13"></a>

## Supervisor calls<a id="sec-13-1" name="sec-13-1"></a>

Supervisor call happens, when the svc (previously called swi)
instruction get executed. Exception is then entered. Supervisor call is
the standard way for user process to ask the kernel for something. As
user code might request many different things, the kernel must somehow
know which one was requested. The svc instruction takes one immediate
operand. The supervisor call exception handler can check at what address
the execution was, read svc instruction from there and inspect it's
bytes. This way, by executing svc with different immediate values, the
used mode code can request different things from the kernel - the value
in svc shall encode the request's type.

To save time and for the sake of simplicity, we don't make use of
immediades in svc and instead we encode call's type in r0. In our
implementation we decided, that supervisor call will preserve and
clobber the same registers as function call and it will return values
through r0, just as function call. This enables us to use actually
perform the supervisor call as call to function defined in
src/arm/PL0/svc.S. Calls from C are performed in
src/arm/PL0/PL0\\<sub>utils</sub>.c and request type encodings are defined in
src/arm/common/svc\\<sub>interface</sub>.h (they must be known to both user mode
code and handler code).

## Utilities<a id="sec-13-2" name="sec-13-2"></a>

We've compiled useful utilities (i.e. memcpy(), strlen(), etc.) in
src/arm/common/strings.c. Those Do not depend on the environment and can
be used by both user mode code, kernel code, even bootloader code.
Functions used for io (like puts()) are also defined in common way for
privileged and unprivileged code. They do, however, rely on the
existence of putchar() and getchar(). In PL0 code
(src/arm/PL0/PL0\\<sub>utils</sub>.c), putchar() and getchar() are defined to
perform a supervisor call, that does that operation. In the PL1 code,
they are defined as operations on UART.

## Timers<a id="sec-13-3" name="sec-13-3"></a>

Several timers are available on the RaspberryPi:

1.  System Timer (with 4 interrupt lines, regarded as the most reliable,
    as it is not derived from the system clock and hence is not affecter
    by processor power mode changes),
    [BCM2837 ARM Peripherals, Chapter 12](https://cs140e.sergio.bz/docs/BCM2837-ARM-Peripherals.pdf)
2.  ARM side Timer (based on a ARM AP804)
    [BCM2837 ARM Peripherals, Chapter 14](https://cs140e.sergio.bz/docs/BCM2837-ARM-Peripherals.pdf)
3.  ARM Generic Timer (optional extension to ARMv7-A and ARMv7-R,
    configured through coprocessor registers)

At first, we attempted to use the System Timer, some code for which is
still present in src/arm/PL1/kernel/bcmclock.h. The interrupts from that
timer are not, however, routed to any ARM core under rpi-open-firmware,
but rather to the GPU. Because of that, we ended using the ARM side
Timer (programmed in src/arm/PL1/kernel/armclock.h). The ARM side Timer
based on ARM AP804 is currently only available on real hardware and not
in qemu. Programming the ARM Generic Timer (listed in TODOs) could
enable the use of timer interrupts in qemu.

## UARTs<a id="sec-13-4" name="sec-13-4"></a>

src/arm/PL1/PL1\\<sub>common</sub>/uart.c implements putchar() and getchar() in
terms of UART. Those implementations are blocking - they poll UART
peripheral registers in a loop, checking, if the device is ready to
perform the operation. They are, however, accompanied by functions
getchar\\<sub>non\\</sub><sub>blocking</sub>() and putchar\\<sub>non\\</sub><sub>blocking</sub>(), that check **once**
if the device is ready and only perform the operation if it is.
Otherwise, they return an error value, Their purpose is to use them with
interrupts. In interrupt-driven UART we avoid waiting in a loop -
instead, an IRQ comes when desired UART's operation completes. The code
that wants to write/read from UART, does, however, need to tie it's
operation with IRQ handler and scheduler. Blocking versions should not
be used once UART interrupts are enabled or in exception handlers, that
should always run quickly. However, doing this does not break UART and
might be justified for debugging purposes (like error() function defined
in src/arm/common/io.c and used throughout the kernel code).

There are 2 UARTs in RapsberryPi. One mini UART (also called UART 1) and
one PL011 UART (also called UART 0). The PL011 UART is used exclusively
in this project. The hardware allows some degree of configuration of
which pins which UART is routed to (via so-called alternative
functions). In our project it is assumed, that UART 0's TX and RX are
routed to GPIO pins 14 & 15 by the firmware, which is true for
rpi-open-firmware. With stock Broadcom firmware, either changing the
default configuration (config.txt) or selection of alternative fuctions
as part of uart initialization (present in TODOs list) might be
required.

Before UART can be used, GPIO pins 14 and 15 should have pull up/down
disabled. This is done as part of UART initialization in uart\\<sub>init</sub>() in
src/arm/PL1/PL1\\<sub>common</sub>/uart.c. There is a requirement that UART is
disabled when being configured, which is also fulfilled by uart\\<sub>init</sub>().
The PL011 is toroughly described in
[BCM2837 ARM Peripherals](https://cs140e.sergio.bz/docs/BCM2837-ARM-Peripherals.pdf) as well as [PrimeCell UART (PL011) Technical Reference Manual](http://infocenter.arm.com/help/topic/com.arm.doc.ddi0183f/DDI0183.pdf).

# Afterword<a id="sec-14" name="sec-14"></a>

This project has been done as part of the Embedded Systems course on
[AGH University of Science and Technology](https://www.agh.edu.pl/en/). The goal of the project was to investigate and program the
MMU (Memory Management Unit) of the RaspberryPi, but ended up to form a
basis of a small operating system.
[RaspberyPi 3 model B](https://www.raspberrypi.org/products/raspberry-pi-3-model-b/) was the hardware platform used, with stock firmware replaced
with
[rpi-open-firmware](https://github.com/christinaa/rpi-open-firmware).
An emulator, [qemu](https://www.qemu.org/download/) (version 2.9.1)
capable of emulating an older RaspberryPi 2 was also used extensively.

The project was written in C programming language and ARM assembly.
Knowlegde of C is required to understand the code. Knowledge of ARM
assembly is useful, but it should be considered a thing, that can be
learned **while** working with it. Still, the reader should at least have
an idea of what assembly language is and how it is used.

This documentation is intended to provide information on bare-metal
programming on the RapsberryPi and ARM in general, as well as
description of our solutions and implementations. There is a lot of
information available on the topic in online sources, yet it is not always in an
easy-to-understand form and the amount of different options described in
manuals might me overwhelming for people new to the topic. That's why we
attempted to describe our work in a way the audience of bare-metal
programming newcomers will find useful. External resources we used are listed at the end of the documentation.

It is planned, for future years students of the Embedded Systems course,
to have an option to continue or reuse previous projects, such as this
one. We hope this documentation will prove useful to our younger
colleagues who happen to be work with the codebase.

In case on any bugs or questions, the authors can be contacted at kwojtus@protonmail.com.

# Sources of Information<a id="sec-15" name="sec-15"></a>

-   wiki.osdev.org
-   ARM GCC Inline Assembler Cookbook - <http://www.ethernut.de/en/documents/arm-inline-asm.html>
-   ARM Architecture Reference ManualĀ® ARMv7-A and ARMv7-R edition - <https://static.docs.arm.com/ddi0406/c/DDI0406C_C_arm_architecture_reference_manual.pdf> (probably the most useful document of all)
-   dwelch67 repository - <https://github.com/dwelch67/raspberrypi>
-   Booting ARM Linux - <http://www.simtec.co.uk/products/SWLINUX/files/booting_article.html> - very good description of atags
-   BCM2835 ARM Peripherals - <https://github.com/raspberrypi/documentation/blob/master/hardware/raspberrypi/bcm2835/BCM2835-ARM-Peripherals.pdf>
    -   BCM2835 datasheet errata -  <https://elinux.org/BCM2835_datasheet_errata>
-   Device Tree Specification - <https://buildmedia.readthedocs.org/media/pdf/devicetree-specification/latest/devicetree-specification.pdf>
-   online ARM Compiler toolchain Assembler Reference - <http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/index.html> - useful for it's descriptions of arm instructions, often shows high in search results
-   Christina Brook's rpi-open-firmware - <https://github.com/christinaa/rpi-open-firmware>
-   PrimeCell UART (PL011) Technical Reference Manual - <http://infocenter.arm.com/help/topic/com.arm.doc.ddi0183g/DDI0183G_uart_pl011_r1p5_trm.pdf>
-   GNU Make Manual - <https://www.gnu.org/software/make/manual/>
-   Red Hat Enterprise Linux 4: Using ld, the Gnu Linker - <https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/4/html/Using_ld_the_GNU_Linker/sections.html>