Fossil SCM
Worked out how to get systemd-container (a.k.a. nspawn + machinectl) working with the stock Fossil container. Following the above commits, it's pure documentation. Removed the runc and crun docs at the same time since this is as small as crun while being more functional; there's zero reaon to push through all the additional complexity of those even lower-level tools now that this method is debugged and documented.
Commit
930a655a14e9b040fcb4156bdda544ea9f6b684e2756ed15eb2ffb2bd4f6306a
Parent
0733be502bdab6a…
1 file changed
+319
-245
+319
-245
| --- www/containers.md | ||
| +++ www/containers.md | ||
| @@ -484,11 +484,11 @@ | ||
| 484 | 484 | that’s still a big chunk of your storage budget. It takes 100:1 overhead |
| 485 | 485 | just to run a 4 MiB Fossil server container? Once again, I wouldn’t |
| 486 | 486 | blame you if you noped right on out of here, but if you will be patient, |
| 487 | 487 | you will find that there are ways to run Fossil inside a container even |
| 488 | 488 | on entry-level cloud VPSes. These are well-suited to running Fossil; you |
| 489 | -don’t have to resort to [raw Fossil service](./server/) to succeed, | |
| 489 | +don’t have to resort to [raw Fossil service][srv] to succeed, | |
| 490 | 490 | leaving the benefits of containerization to those with bigger budgets. |
| 491 | 491 | |
| 492 | 492 | For the sake of simple examples in this section, we’ll assume you’re |
| 493 | 493 | integrating Fossil into a larger web site, such as with our [Debian + |
| 494 | 494 | nginx + TLS][DNT] plan. This is why all of the examples below create |
| @@ -521,10 +521,11 @@ | ||
| 521 | 521 | this idea to the rest of your site.) |
| 522 | 522 | |
| 523 | 523 | [DD]: https://www.docker.com/products/docker-desktop/ |
| 524 | 524 | [DE]: https://docs.docker.com/engine/ |
| 525 | 525 | [DNT]: ./server/debian/nginx.md |
| 526 | +[srv]: ./server/ | |
| 526 | 527 | |
| 527 | 528 | |
| 528 | 529 | ### 6.1 <a id="nerdctl" name="containerd"></a>Stripping Docker Engine Down |
| 529 | 530 | |
| 530 | 531 | The core of Docker Engine is its [`containerd`][ctrd] daemon and the |
| @@ -556,12 +557,12 @@ | ||
| 556 | 557 | give up the image builder is [Podman]. Initially created by |
| 557 | 558 | Red Hat and thus popular on that family of OSes, it will run on |
| 558 | 559 | any flavor of Linux. It can even be made to run [on macOS via Homebrew][pmmac] |
| 559 | 560 | or [on Windows via WSL2][pmwin]. |
| 560 | 561 | |
| 561 | -On Ubuntu 22.04, it’s about a quarter the size of Docker Engine, or half | |
| 562 | -that of the “full” distribution of `nerdctl` and all its dependencies. | |
| 562 | +On Ubuntu 22.04, the installation size is about 38 MiB, roughly a | |
| 563 | +tenth the size of Docker Engine. | |
| 563 | 564 | |
| 564 | 565 | Although Podman [bills itself][whatis] as a drop-in replacement for the |
| 565 | 566 | `docker` command and everything that sits behind it, some of the tool’s |
| 566 | 567 | design decisions affect how our Fossil containers run, as compared to |
| 567 | 568 | using Docker. The most important of these is that, by default, Podman |
| @@ -703,251 +704,322 @@ | ||
| 703 | 704 | container images across the Internet, it can be a net win in terms of |
| 704 | 705 | build time. |
| 705 | 706 | |
| 706 | 707 | |
| 707 | 708 | |
| 708 | -### 6.3 <a id="barebones"></a>Bare-Bones OCI Bundle Runners | |
| 709 | - | |
| 710 | -If even the Podman stack is too big for you, you still have options for | |
| 711 | -running containers that are considerably slimmer, at a high cost to | |
| 712 | -administration complexity and loss of features. | |
| 713 | - | |
| 714 | -Part of the OCI standard is the notion of a “bundle,” being a consistent | |
| 715 | -way to present a pre-built and configured container to the runtime. | |
| 716 | -Essentially, it consists of a directory containing a `config.json` file | |
| 717 | -and a `rootfs/` subdirectory containing the root filesystem image. Many | |
| 718 | -tools can produce these for you. We’ll show only one method in the first | |
| 719 | -section below, then reuse that in the following sections. | |
| 720 | - | |
| 721 | - | |
| 722 | -#### 6.3.1 <a id="runc"></a>`runc` | |
| 723 | - | |
| 724 | -We mentioned `runc` [above](#nerdctl), but it’s possible to use it | |
| 725 | -standalone, without `containerd` or its CLI frontend `nerdctl`. You also | |
| 726 | -lose the build engine, intelligent image layer sharing, image registry | |
| 727 | -connections, and much more. The plus side is that `runc` alone is | |
| 728 | -18 MiB. | |
| 729 | - | |
| 730 | -Using it without all the support tooling isn’t complicated, but it *is* | |
| 731 | -cryptic enough to want a shell script. Let’s say we want to build on our | |
| 732 | -big desktop machine but ship the resulting container to a small remote | |
| 733 | -host. This should serve: | |
| 734 | - | |
| ----- | ||
| 735 | - | |
| 736 | -```shell | |
| 737 | -#!/bin/bash -ex | |
| 738 | -c=fossil | |
| 739 | -b=/var/lib/machines/$c | |
| 740 | -h=my-host.example.com | |
| 741 | -m=/run/containerd/io.containerd.runtime.v2.task/moby | |
| 742 | -t=$(mktemp -d /tmp/$c-bundle.XXXXXX) | |
| 743 | - | |
| 744 | -if [ -d "$t" ] | |
| 745 | -then | |
| 746 | - docker container start $c | |
| 747 | - docker container export $c > $t/rootfs.tar | |
| 748 | - id=$(docker inspect --format="{{.Id}}" $c) | |
| 749 | - sudo cat $m/$id/config.json \ | |
| 750 | - | jq '.root.path = "'$b/rootfs'"' | |
| 751 | - | jq '.linux.cgroupsPath = ""' | |
| 752 | - | jq 'del(.linux.sysctl)' | |
| 753 | - | jq 'del(.linux.namespaces[] | select(.type == "network"))' | |
| 754 | - | jq 'del(.mounts[] | select(.destination == "/etc/hostname"))' | |
| 755 | - | jq 'del(.mounts[] | select(.destination == "/etc/resolv.conf"))' | |
| 756 | - | jq 'del(.mounts[] | select(.destination == "/etc/hosts"))' | |
| 757 | - | jq 'del(.hooks)' > $t/config.json | |
| 758 | - scp -r $t $h:tmp | |
| 759 | - ssh -t $h "{ | |
| 760 | - mv ./$t/config.json $b && | |
| 761 | - sudo tar -C $b/rootfs -xf ./$t/rootfs.tar && | |
| 762 | - rm -r ./$t | |
| 763 | - }" | |
| 764 | - rm -r $t | |
| 765 | -fi | |
| 766 | -``` | |
| 767 | - | |
| ----- | ||
| 768 | - | |
| 769 | -The first several lines list configurables: | |
| 770 | - | |
| 771 | -* **`c`**: the name of the Docker container you’re bundling up for use | |
| 772 | - with `runc` | |
| 773 | -* **`b`**: the path of the exported container, called the “bundle” in | |
| 774 | - OCI jargon; we’re using the [`nspawn`](#nspawn) convention, a | |
| 775 | - reasonable choice under the [Linux FHS rules][LFHS] | |
| 776 | -* **`h`**: the remote host name | |
| 777 | -* **`m`**: the local directory holding the running machines, configurable | |
| 778 | - because: | |
| 779 | - * the path name is longer than we want to use inline | |
| 780 | - * it’s been known to change from one version of Docker to the next | |
| 781 | - * you might be building and testing with [Podman](#podman), so it | |
| 782 | - has to be “`/run/user/$UID/crun`” instead | |
| 783 | -* **`t`**: the temporary bundle directory we populate locally, then | |
| 784 | - `scp` to the remote machine, where it’s unpacked | |
| 785 | - | |
| 786 | -[LFHS]: https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard | |
| 787 | - | |
| 788 | - | |
| 789 | -##### Why All That `sudo` Stuff? | |
| 790 | - | |
| 791 | -This script uses `sudo` for two different purposes: | |
| 792 | - | |
| 793 | -1. To read the local `config.json` file out of the `containerd` managed | |
| 794 | - directory, which is owned by `root` on Docker systems. Additionally, | |
| 795 | - that input file is only available while the container is started, so | |
| 796 | - we must ensure that before extracting it. | |
| 797 | - | |
| 798 | -2. To unpack the bundle onto the remote machine. If you try to get | |
| 799 | - clever and unpack it locally, then `rsync` it to the remote host to | |
| 800 | - avoid re-copying files that haven’t changed since the last update, | |
| 801 | - you’ll find that it fails when it tries to copy device nodes, to | |
| 802 | - create files owned only by the remote root user, and so forth. If the | |
| 803 | - container bundle is small, it’s simpler to re-copy and unpack it | |
| 804 | - fresh each time. | |
| 805 | - | |
| 806 | -I point all this out because it might ask for your password twice: once for | |
| 807 | -the local sudo command, and once for the remote. | |
| 808 | - | |
| 809 | - | |
| 810 | - | |
| 811 | -##### Why All That `jq` Stuff? | |
| 812 | - | |
| 813 | -We’re using [jq] for two separate purposes: | |
| 814 | - | |
| 815 | -1. To automatically transmogrify Docker’s container configuration so it | |
| 816 | - will work with `runc`: | |
| 817 | - | |
| 818 | - * point it where we unpacked the container’s exported rootfs | |
| 819 | - * accede to its wish to [manage cgroups by itself][ecg] | |
| 820 | - * remove the `sysctl` calls that will break after… | |
| 821 | - * …we remove the network namespace to allow Fossil’s TCP listening | |
| 822 | - port to be available on the host; `runc` doesn’t offer the | |
| 823 | - equivalent of `docker create --publish`, and we can’t be | |
| 824 | - bothered to set up a manual mapping from the host port into the | |
| 825 | - container | |
| 826 | - * remove file bindings that point into the local runtime managed | |
| 827 | - directories; one of the things we give up by using a bare | |
| 828 | - container runner is automatic management of these files | |
| 829 | - * remove the hooks for essentially the same reason | |
| 830 | - | |
| 831 | -2. To make the Docker-managed machine-readable `config.json` more | |
| 832 | - human-readable, in case there are other things you want changed in | |
| 833 | - this version of the container. Exposing the `config.json` file like | |
| 834 | - this means you don’t have to rebuild the container merely to change | |
| 835 | - a value like a mount point, the kernel capability set, and so forth. | |
| 836 | - | |
| 837 | - | |
| 838 | -##### Running the Bundle | |
| 839 | - | |
| 840 | -With the container exported to a bundle like this, you can start it as: | |
| 841 | - | |
| 842 | -``` | |
| 843 | - $ cd /path/to/bundle | |
| 844 | - $ c=fossil-runc ← …or anything else you prefer | |
| 845 | - $ sudo runc create $c | |
| 846 | - $ sudo runc start $c | |
| 847 | - $ sudo runc exec $c -t sh -l | |
| 848 | - ~ $ ls museum | |
| 849 | - repo.fossil | |
| 850 | - ~ $ ps -eaf | |
| 851 | - PID USER TIME COMMAND | |
| 852 | - 1 fossil 0:00 bin/fossil server --create … | |
| 853 | - ~ $ exit | |
| 854 | - $ sudo runc kill $c | |
| 855 | - $ sudo runc delete $c | |
| 856 | -``` | |
| 857 | - | |
| 858 | -If you’re doing this on the export host, the first command is “`cd $b`” | |
| 859 | -if we’re using the variables from the shell script above. Alternately, | |
| 860 | -the `runc` subcommands that need to read the bundle files take a | |
| 861 | -`--bundle/-b` flag to let you avoid switching directories. | |
| 862 | - | |
| 863 | -The rest should be straightforward: create and start the container as | |
| 864 | -root so the `chroot(2)` call inside the container will succeed, then get | |
| 865 | -into it with a login shell and poke around to prove to ourselves that | |
| 866 | -everything is working properly. It is. Yay! | |
| 867 | - | |
| 868 | -The remaining commands show shutting the container down and destroying | |
| 869 | -it, simply to show how these commands change relative to using the | |
| 870 | -Docker Engine commands. It’s “kill,” not “stop,” and it’s “delete,” not | |
| 871 | -“rm.” | |
| 872 | - | |
| 873 | -[ecg]: https://github.com/opencontainers/runc/pull/3131 | |
| 874 | -[jq]: https://stedolan.github.io/jq/ | |
| 875 | - | |
| 876 | - | |
| 877 | -##### Lack of Layer Sharing | |
| 878 | - | |
| 879 | -The bundle export process collapses Docker’s union filesystem down to a | |
| 880 | -single layer. Atop that, it makes all files mutable. | |
| 881 | - | |
| 882 | -All of this is fine for tiny remote hosts with a single container, or at | |
| 883 | -least one where none of the containers share base layers. Where it | |
| 884 | -becomes a problem is when you have multiple Fossil containers on a | |
| 885 | -single host, since they all derive from the same base image. | |
| 886 | - | |
| 887 | -The full-featured container runtimes above will intelligently share | |
| 888 | -these immutable base layers among the containers, storing only the | |
| 889 | -differences in each individual container. More, when pulling images from | |
| 890 | -a registry host, they’ll transfer only the layers you don’t have copies | |
| 891 | -of locally, so you don’t have to burn bandwidth sending copies of Alpine | |
| 892 | -and BusyBox each time, even though they’re unlikely to change from one | |
| 893 | -build to the next. | |
| 894 | - | |
| 895 | - | |
| 896 | -#### 6.3.2 <a id="crun"></a>`crun` | |
| 897 | - | |
| 898 | -In the same way that [Docker Engine is based on `runc`](#runc), Podman’s | |
| 899 | -engine is based on [`crun`][crun], a lighter-weight alternative to | |
| 900 | -`runc`. It’s only 1.4 MiB on the system I tested it on, yet it will run | |
| 901 | -the same container bundles as in my `runc` examples above. We saved | |
| 902 | -more than that by compressing the container’s Fossil executable with | |
| 903 | -UPX, making the runtime virtually free in this case. The only question | |
| 904 | -is whether you can put up with its limitations, which are the same as | |
| 905 | -for `runc`. | |
| 906 | - | |
| 907 | -[crun]: https://github.com/containers/crun | |
| 908 | - | |
| 909 | - | |
| 910 | -#### 6.3.3 <a id="nspawn"></a>`systemd-nspawn` | |
| 911 | - | |
| 912 | -As of `systemd` version 242, its optional `nspawn` piece | |
| 913 | -[reportedly](https://www.phoronix.com/news/Systemd-Nspawn-OCI-Runtime) | |
| 914 | -got the ability to run OCI bundles directly. You might | |
| 915 | -have it installed already, but if not, it’s only about 2 MiB. It’s | |
| 916 | -in the `systemd-containers` package as of Ubuntu 22.04 LTS: | |
| 917 | - | |
| 918 | -``` | |
| 919 | - $ sudo apt install systemd-containers | |
| 920 | -``` | |
| 921 | - | |
| 922 | -It’s also in CentOS Stream 9, under the same name. | |
| 923 | - | |
| 924 | -You create the bundles the same way as with [the `runc` method | |
| 925 | -above](#runc). The only thing that changes are the top-level management | |
| 926 | -commands: | |
| 927 | - | |
| 928 | -``` | |
| 929 | - $ sudo systemd-nspawn \ | |
| 930 | - --oci-bundle=/var/lib/machines/fossil \ | |
| 931 | - --machine=fossil \ | |
| 932 | - --network-veth \ | |
| 933 | - --port=127.0.0.1:127.0.0.1:9999:8080 | |
| 934 | - $ sudo machinectl list | |
| 935 | - No machines. | |
| 936 | -``` | |
| 937 | - | |
| 938 | -This is why I wrote “reportedly” above: I couldn’t get it to work on two different | |
| 939 | -Linux distributions, and I can’t see why. I’m leaving this here to give | |
| 940 | -someone else a leg up, with the hope that they will work out what’s | |
| 941 | -needed to get the container running and registered with `machinectl`. | |
| 942 | - | |
| 943 | -As of this writing, the tool expects an OCI container version of | |
| 944 | -“1.0.0”. I had to edit this at the top of my `config.json` file to get | |
| 945 | -the first command to read the bundle. The fact that it errored out when | |
| 946 | -I had “`1.0.2-dev`” in there proves it’s reading the file, but it | |
| 947 | -doesn’t seem able to make sense of what it finds there, and it doesn’t | |
| 948 | -give any diagnostics to say why. | |
| 949 | - | |
| 709 | +### 6.3 <a id="nspawn"></a>`systemd-container` | |
| 710 | + | |
| 711 | +If even the Podman stack is too big for you, the next-best option I’m | |
| 712 | +aware of is the `systemd-container` infrastructure on modern Linuxes, | |
| 713 | +available since version 239 or so. Its runtime tooling requires only | |
| 714 | +about 1.4 MiB of disk space: | |
| 715 | + | |
| 716 | +``` | |
| 717 | + $ sudo apt install systemd-container btrfs-tools | |
| 718 | +``` | |
| 719 | + | |
| 720 | +That command assumes the primary test environment for | |
| 721 | +this guide, Ubuntu 22.04 LTS with `systemd` 249. For best | |
| 722 | +results, `/var/lib/machines` should be a btrfs volume, because | |
| 723 | +[`$REASONS`][mcfad]. (For CentOS Stream 9 and other Red Hattish | |
| 724 | +systems, you will have to make serveral adjustments, which we’ve | |
| 725 | +collected [below](#nspawn-centos) to keep these examples clear.) | |
| 726 | + | |
| 727 | +The first configuration step is to convert the Docker container into | |
| 728 | +a “machine”, as systemd calls it. The easiest method is: | |
| 729 | + | |
| 730 | +``` | |
| 731 | + $ make container-run | |
| 732 | + $ docker container export fossil-e119d5983620 | | |
| 733 | + machinectl import-tar - myproject | |
| 734 | +``` | |
| 735 | + | |
| 736 | +Copy the container name from the first step to the second. Yours will | |
| 737 | +almost certainly be named after a different Fossil commit ID. | |
| 738 | + | |
| 739 | +It’s important that the name of the machine you create — | |
| 740 | +“`myproject`” in this example — matches the base name | |
| 741 | +of the nspawn configuration file you create as the next step. | |
| 742 | +Therefore, to extend the example, the following file needs to be | |
| 743 | +called `/etc/systemd/nspawn/myproject.nspawn`, and it will contain | |
| 744 | +something like: | |
| 745 | + | |
| 746 | +---- | |
| 747 | + | |
| 748 | +``` | |
| 749 | +[Exec] | |
| 750 | +WorkingDirectory=/jail | |
| 751 | +Parameters=bin/fossil server \ | |
| 752 | + --baseurl https://example.com/myproject \ | |
| 753 | + --chroot /jail \ | |
| 754 | + --create \ | |
| 755 | + --jsmode bundled \ | |
| 756 | + --localhost \ | |
| 757 | + --port 9000 \ | |
| 758 | + --scgi \ | |
| 759 | + --user admin \ | |
| 760 | + museum/repo.fossil | |
| 761 | +DropCapability= \ | |
| 762 | + CAP_AUDIT_WRITE \ | |
| 763 | + CAP_CHOWN \ | |
| 764 | + CAP_FSETID \ | |
| 765 | + CAP_KILL \ | |
| 766 | + CAP_MKNOD \ | |
| 767 | + CAP_NET_BIND_SERVICE \ | |
| 768 | + CAP_NET_RAW \ | |
| 769 | + CAP_SETFCAP \ | |
| 770 | + CAP_SETPCAP | |
| 771 | +ProcessTwo=yes | |
| 772 | +LinkJournal=no | |
| 773 | +Timezone=no | |
| 774 | + | |
| 775 | +[Files] | |
| 776 | +Bind=/home/fossil/museum/myproject:/jail/museum | |
| 777 | + | |
| 778 | +[Network] | |
| 779 | +VirtualEthernet=no | |
| 780 | +``` | |
| 781 | + | |
| 782 | +---- | |
| 783 | + | |
| 784 | +If you recognize most of that from the `Dockerfile` discussion above, | |
| 785 | +congratulations, you’ve been paying attention. The rest should also | |
| 786 | +be clear from context. | |
| 787 | + | |
| 788 | +Some of this is expected to vary. For one, the command given in the | |
| 789 | +`Parameters` directive assumes [SCGI proxying via nginx][DNT]. For | |
| 790 | +other use cases, see our collection of [Fossil server configuration | |
| 791 | +guides][srv], then adjust the command to your local needs. | |
| 792 | +For another, you will likely have to adjust the `Bind` value to | |
| 793 | +point at the directory containing the `repo.fossil` file referenced | |
| 794 | +in the command. | |
| 795 | + | |
| 796 | +We also need a generic systemd unit file called | |
| 797 | +`/etc/systemd/system/[email protected]`, containing: | |
| 798 | + | |
| 799 | +---- | |
| 800 | + | |
| 801 | +``` | |
| 802 | +[Unit] | |
| 803 | +Description=Fossil %i Repo Service | |
| 804 | +[email protected] [email protected] | |
| 805 | +After=network.target systemd-resolved.service [email protected] [email protected] | |
| 806 | + | |
| 807 | +[Service] | |
| 808 | +ExecStart=systemd-nspawn --settings=override --read-only --machine=%i bin/fossil | |
| 809 | + | |
| 810 | +[Install] | |
| 811 | +WantedBy=multi-user.target | |
| 812 | +``` | |
| 813 | + | |
| 814 | +---- | |
| 815 | + | |
| 816 | +You shouldn’t have to change any of this because we’ve given the | |
| 817 | +`--setting=override` flag, meaning any setting in the nspawn file | |
| 818 | +overrides the setting passed to `systemd-nspawn`. This arrangement | |
| 819 | +not only keeps the unit file simple, it allows multiple services to | |
| 820 | +share the base configuration, varying on a per-repo level. | |
| 821 | + | |
| 822 | +Start the service in the normal way: | |
| 823 | + | |
| 824 | +``` | |
| 825 | + $ sudo systemctl enable fossil@myproject | |
| 826 | + $ sudo systemctl start fossil@myproject | |
| 827 | +``` | |
| 828 | + | |
| 829 | +You should find it running on localhost port 9000 per the nspawn | |
| 830 | +configuration file above, suitable for proxying Fossil out to the | |
| 831 | +public using nginx, via SCGI. If you aren’t using a front-end proxy | |
| 832 | +and want Fossil exposed to the world, you might say this instead in | |
| 833 | +the `nspawn` file: | |
| 834 | + | |
| 835 | +``` | |
| 836 | +Parameters=bin/fossil server \ | |
| 837 | + --cert /path/to/my/fullchain.pem \ | |
| 838 | + --chroot /jail \ | |
| 839 | + --create \ | |
| 840 | + --jsmode bundled \ | |
| 841 | + --port 443 \ | |
| 842 | + --user admin \ | |
| 843 | + museum/repo.fossil | |
| 844 | +``` | |
| 845 | + | |
| 846 | +You would also need to un-drop the `CAP_NET_BIND_SERVICE` capability | |
| 847 | +to allow Fossil to bind to this low-numbered port. | |
| 848 | + | |
| 849 | +We use systemd’s template file feature to allow multiple Fossil | |
| 850 | +servers running on a single machine, each on a different TCP port, | |
| 851 | +as when proxying them out as subdirectories of a larger site. | |
| 852 | +To add another project, you must first clone the base “machine” layer: | |
| 853 | + | |
| 854 | +``` | |
| 855 | + $ sudo machinectl clone myproject otherthing | |
| 856 | +``` | |
| 857 | + | |
| 858 | +That will not only create a clone of `/var/lib/machines/myproject` | |
| 859 | +as `../otherthing`, it will create a matching `nspawn` file for you | |
| 860 | +as a copy of the first one. Adjust its contents to suit, then enable | |
| 861 | +and start it as above. | |
| 862 | + | |
| 863 | +[mcfad]: https://www.freedesktop.org/software/systemd/man/machinectl.html#Files%20and%20Directories | |
| 864 | + | |
| 865 | + | |
| 866 | +### 6.3.1 <a id="nspawn-rhel"></a>Getting It Working on a RHEL Clone | |
| 867 | + | |
| 868 | +The biggest difference between doing this on OSes like CentOS versus | |
| 869 | +Ubuntu is that RHEL (thus also its clones) doesn’t ship btrfs in | |
| 870 | +its kernel, thus has no option for installing `mkfs.btrfs`, which | |
| 871 | +[`machinectl`][mctl] needs for various purposes. | |
| 872 | + | |
| 873 | +Fortunately, there are workarounds. | |
| 874 | + | |
| 875 | +First, the `apt install` command above becomes: | |
| 876 | + | |
| 877 | +``` | |
| 878 | + $ sudo dnf install systemd-container | |
| 879 | +``` | |
| 880 | + | |
| 881 | +Second, you have to hack around the lack of `machinectl import-tar` so: | |
| 882 | + | |
| 883 | +``` | |
| 884 | + $ rootfs=/var/lib/machines/fossil | |
| 885 | + $ sudo mkdir -p $rootfs | |
| 886 | + $ docker container export fossil | sudo tar -xf -C $rootfs - | |
| 887 | +``` | |
| 888 | + | |
| 889 | +The parent directory path in the `rootfs` variable is important, | |
| 890 | +because although we aren’t using `machinectl`, the `systemd-nspawn` | |
| 891 | +developers assume you’re using them together. Thus, when you give | |
| 892 | +`--machine`, it assumes the `machinectl` directory scheme. You could | |
| 893 | +instead use `--directory`, allowing you to store the rootfs whereever | |
| 894 | +you like, but why make things difficult? It’s a perfectly sensible | |
| 895 | +default, consistent with the [LHS] rules. | |
| 896 | + | |
| 897 | +The final element — the machine name — can be anything | |
| 898 | +you like so long as it matches the nspawn file’s base name. | |
| 899 | + | |
| 900 | +Finally, since you can’t use `machinectl clone`, you have to make | |
| 901 | +a wasteful copy of `/var/lib/machines/myproject` when standing up | |
| 902 | +multiple Fossil repo services on a single machine. (This is one | |
| 903 | +of the reasons `machinectl` depends on `btrfs`: cheap copy-on-write | |
| 904 | +subvolumes.) Because we give the `--read-only` flag, you can simply | |
| 905 | +`cp -r` one machine to a new name rather than go through the | |
| 906 | +export-and-import dance you used to create the first one. | |
| 907 | + | |
| 908 | +[LHS]: https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html | |
| 909 | +[mctl]: https://www.freedesktop.org/software/systemd/man/machinectl.html | |
| 910 | + | |
| 911 | + | |
| 912 | +### 6.3.2 <a id="nspawn-weaknesses"></a>What Am I Missing Out On? | |
| 913 | + | |
| 914 | +For all the runtime size savings in this method, you may be wondering | |
| 915 | +what you’re missing out on relative to Podman, which takes up | |
| 916 | +roughly 27× more disk space. Short answer: lots. Long answer: | |
| 917 | + | |
| 918 | +1. **Build system.** You’ll have to build and test your containers | |
| 919 | + some other way. This method is only suitable for running them | |
| 920 | + once they’re built. | |
| 921 | + | |
| 922 | +2. **Orchestration.** All of the higher-level things like | |
| 923 | + “compose” files, Docker Swarm mode, and Kubernetes are | |
| 924 | + unavailable to you at this level. You can run multiple | |
| 925 | + instances of Fossil, but on a single machine only and with a | |
| 926 | + static configuration. | |
| 927 | + | |
| 928 | +3. **Image layer sharing.** When you update an image using one of the | |
| 929 | + above methods, Docker and Podman are smart enough to copy only | |
| 930 | + changed layers. Furthermore, when you base multiple containers | |
| 931 | + on a single image, they don’t make copies of the base layers; | |
| 932 | + they can share them, because base layers are immutable, thus | |
| 933 | + cannot cross-contaminate. | |
| 934 | + | |
| 935 | + Because we use `sysetmd-nspawn --read-only`, we get *some* | |
| 936 | + of this benefit, particularly when using `machinectl` with | |
| 937 | + `/var/lib/machines` as a btrfs volume. Even so, the disk space | |
| 938 | + and network I/O optimizations go deeper in the Docker and Podman | |
| 939 | + worlds. | |
| 940 | + | |
| 941 | +4. **Tooling.** Hand-creating and modifying those systemd | |
| 942 | + files sucks compared to “`podman container create ...`” This | |
| 943 | + is but one of many affordances you will find in the runtimes | |
| 944 | + aimed at daily-use devops warriors. | |
| 945 | + | |
| 946 | +5. **Network virtualization.** In the scheme above, we turn off the | |
| 947 | + `systemd` virtual netorking support because in its default mode, | |
| 948 | + it wants to hide the service entirely. | |
| 949 | + | |
| 950 | + Another way to put this is that `systemd-nspawn --port` does | |
| 951 | + approximately *nothing* of what `docker create --publish` does | |
| 952 | + despite their superficial similarities. | |
| 953 | + | |
| 954 | + For this container, it doesn’t much matter, since it exposes | |
| 955 | + only a single port, and we do want that one port exposed, one way | |
| 956 | + or another. Beyond that, we get all the control we need using | |
| 957 | + Fossil options like `--localhost`. I point this out because in | |
| 958 | + more complex situations, the automatic network setup features of | |
| 959 | + the more featureful runtimes can save a lot of time and hassle. | |
| 960 | + They aren’t doing anything you couldn’t do by hand, but why | |
| 961 | + would you want to, given the choice? | |
| 962 | + | |
| 963 | +I expect there’s a lot more I neglected to think of when creating | |
| 964 | +this list, but I think it suffices to make my case as it is. If you | |
| 965 | +can afford the space of Podman or Docker, I strongly recommend using | |
| 966 | +either of them over the much lower-level `systemd-container` | |
| 967 | +infrastructure. | |
| 968 | + | |
| 969 | +(Incidentally, these are essentially the same reasons why we no longer | |
| 970 | +talk about the `crun` tool underpinning Podman in this document. It’s | |
| 971 | +even more limited, making it even more difficult to administer while | |
| 972 | +providing no runtime size advantage. The `runc` tool underpinning | |
| 973 | +Docker is even worse on this score, being scarcely easier to use than | |
| 974 | +`crun` while having a much larger footprint.) | |
| 975 | + | |
| 976 | + | |
| 977 | +### 6.3.3 <a id="nspawn-assumptions"></a>Violated Assumptions | |
| 978 | + | |
| 979 | +The `systemd-container` infrastructure has a bunch of hard-coded | |
| 980 | +assumptions baked into it. We papered over these problems above, | |
| 981 | +but if you’re using these tools for other purposes on the machine | |
| 982 | +you’re serving Fossil from, you may need to know which assumptions | |
| 983 | +our container violates and the resulting consequences: | |
| 984 | + | |
| 985 | +1. `systemd-nspawn` works best with `machinectl`, but if you haven’t | |
| 986 | + got `btrfs` available, you run into [trouble](#nspawn-rhel). | |
| 987 | + | |
| 988 | +2. Our stock container starts a single static executable inside | |
| 989 | + a stripped-to-the-bones container rather than “boot” an OS | |
| 990 | + image, causing a bunch of commands to fail: | |
| 991 | + | |
| 992 | + * **`machinectl poweroff`** will fail because the container | |
| 993 | + isn’t running dbus. | |
| 994 | + * **`machinectl start`** will try to find an `/sbin/init` | |
| 995 | + program in the rootfs, which we haven’t got. We could | |
| 996 | + rename `/jail/bin/fossil` to `/sbin/init` and then hack | |
| 997 | + the chroot scheme to match, but ick. (This, incidentally, | |
| 998 | + is why we set `ProcessTwo=yes` above even though Fossil is | |
| 999 | + perfectly capable of running as PID 1, a fact we depend on | |
| 1000 | + in the other methods above.) | |
| 1001 | + * **`machinectl shell`** will fail because there is no login | |
| 1002 | + daemon running, which we purposefully avoided adding by | |
| 1003 | + creating a “`FROM scratch`” container. (If you need a | |
| 1004 | + shell, say: `sudo systemd-nspawn --machine=myproject /bin/sh`) | |
| 1005 | + * **`machinectl status`** won’t give you the container logs | |
| 1006 | + because we disabled the shared journal, which was in turn | |
| 1007 | + necessary because we don’t run `systemd` *inside* the | |
| 1008 | + container, just outside. | |
| 1009 | + | |
| 1010 | + If these are problems for you, you may wish to build a | |
| 1011 | + fatter container using `debootstrap` or similar. ([External | |
| 1012 | + tutorial][medtut].) | |
| 1013 | + | |
| 1014 | +3. We disable the “private networking” feature since the whole | |
| 1015 | + point of this container is to expose a network service to the | |
| 1016 | + public, one way or another. If you do things the way the defaults | |
| 1017 | + (and thus the official docs) expect, you must push through | |
| 1018 | + [a whole lot of complexity][ndcmp] to re-expose this single | |
| 1019 | + network port. That complexity is justified only if your service | |
| 1020 | + is itself complex, having both private and public service ports. | |
| 1021 | + | |
| 1022 | +[medtut]: https://medium.com/@huljar/setting-up-containers-with-systemd-nspawn-b719cff0fb8d | |
| 1023 | +[ndcmp]: https://wiki.archlinux.org/title/systemd-networkd#Usage_with_containers | |
| 950 | 1024 | |
| 951 | 1025 | <div style="height:50em" id="this-space-intentionally-left-blank"></div> |
| 952 | 1026 |
| --- www/containers.md | |
| +++ www/containers.md | |
| @@ -484,11 +484,11 @@ | |
| 484 | that’s still a big chunk of your storage budget. It takes 100:1 overhead |
| 485 | just to run a 4 MiB Fossil server container? Once again, I wouldn’t |
| 486 | blame you if you noped right on out of here, but if you will be patient, |
| 487 | you will find that there are ways to run Fossil inside a container even |
| 488 | on entry-level cloud VPSes. These are well-suited to running Fossil; you |
| 489 | don’t have to resort to [raw Fossil service](./server/) to succeed, |
| 490 | leaving the benefits of containerization to those with bigger budgets. |
| 491 | |
| 492 | For the sake of simple examples in this section, we’ll assume you’re |
| 493 | integrating Fossil into a larger web site, such as with our [Debian + |
| 494 | nginx + TLS][DNT] plan. This is why all of the examples below create |
| @@ -521,10 +521,11 @@ | |
| 521 | this idea to the rest of your site.) |
| 522 | |
| 523 | [DD]: https://www.docker.com/products/docker-desktop/ |
| 524 | [DE]: https://docs.docker.com/engine/ |
| 525 | [DNT]: ./server/debian/nginx.md |
| 526 | |
| 527 | |
| 528 | ### 6.1 <a id="nerdctl" name="containerd"></a>Stripping Docker Engine Down |
| 529 | |
| 530 | The core of Docker Engine is its [`containerd`][ctrd] daemon and the |
| @@ -556,12 +557,12 @@ | |
| 556 | give up the image builder is [Podman]. Initially created by |
| 557 | Red Hat and thus popular on that family of OSes, it will run on |
| 558 | any flavor of Linux. It can even be made to run [on macOS via Homebrew][pmmac] |
| 559 | or [on Windows via WSL2][pmwin]. |
| 560 | |
| 561 | On Ubuntu 22.04, it’s about a quarter the size of Docker Engine, or half |
| 562 | that of the “full” distribution of `nerdctl` and all its dependencies. |
| 563 | |
| 564 | Although Podman [bills itself][whatis] as a drop-in replacement for the |
| 565 | `docker` command and everything that sits behind it, some of the tool’s |
| 566 | design decisions affect how our Fossil containers run, as compared to |
| 567 | using Docker. The most important of these is that, by default, Podman |
| @@ -703,251 +704,322 @@ | |
| 703 | container images across the Internet, it can be a net win in terms of |
| 704 | build time. |
| 705 | |
| 706 | |
| 707 | |
| 708 | ### 6.3 <a id="barebones"></a>Bare-Bones OCI Bundle Runners |
| 709 | |
| 710 | If even the Podman stack is too big for you, you still have options for |
| 711 | running containers that are considerably slimmer, at a high cost to |
| 712 | administration complexity and loss of features. |
| 713 | |
| 714 | Part of the OCI standard is the notion of a “bundle,” being a consistent |
| 715 | way to present a pre-built and configured container to the runtime. |
| 716 | Essentially, it consists of a directory containing a `config.json` file |
| 717 | and a `rootfs/` subdirectory containing the root filesystem image. Many |
| 718 | tools can produce these for you. We’ll show only one method in the first |
| 719 | section below, then reuse that in the following sections. |
| 720 | |
| 721 | |
| 722 | #### 6.3.1 <a id="runc"></a>`runc` |
| 723 | |
| 724 | We mentioned `runc` [above](#nerdctl), but it’s possible to use it |
| 725 | standalone, without `containerd` or its CLI frontend `nerdctl`. You also |
| 726 | lose the build engine, intelligent image layer sharing, image registry |
| 727 | connections, and much more. The plus side is that `runc` alone is |
| 728 | 18 MiB. |
| 729 | |
| 730 | Using it without all the support tooling isn’t complicated, but it *is* |
| 731 | cryptic enough to want a shell script. Let’s say we want to build on our |
| 732 | big desktop machine but ship the resulting container to a small remote |
| 733 | host. This should serve: |
| 734 | |
| ----- | |
| 735 | |
| 736 | ```shell |
| 737 | #!/bin/bash -ex |
| 738 | c=fossil |
| 739 | b=/var/lib/machines/$c |
| 740 | h=my-host.example.com |
| 741 | m=/run/containerd/io.containerd.runtime.v2.task/moby |
| 742 | t=$(mktemp -d /tmp/$c-bundle.XXXXXX) |
| 743 | |
| 744 | if [ -d "$t" ] |
| 745 | then |
| 746 | docker container start $c |
| 747 | docker container export $c > $t/rootfs.tar |
| 748 | id=$(docker inspect --format="{{.Id}}" $c) |
| 749 | sudo cat $m/$id/config.json \ |
| 750 | | jq '.root.path = "'$b/rootfs'"' |
| 751 | | jq '.linux.cgroupsPath = ""' |
| 752 | | jq 'del(.linux.sysctl)' |
| 753 | | jq 'del(.linux.namespaces[] | select(.type == "network"))' |
| 754 | | jq 'del(.mounts[] | select(.destination == "/etc/hostname"))' |
| 755 | | jq 'del(.mounts[] | select(.destination == "/etc/resolv.conf"))' |
| 756 | | jq 'del(.mounts[] | select(.destination == "/etc/hosts"))' |
| 757 | | jq 'del(.hooks)' > $t/config.json |
| 758 | scp -r $t $h:tmp |
| 759 | ssh -t $h "{ |
| 760 | mv ./$t/config.json $b && |
| 761 | sudo tar -C $b/rootfs -xf ./$t/rootfs.tar && |
| 762 | rm -r ./$t |
| 763 | }" |
| 764 | rm -r $t |
| 765 | fi |
| 766 | ``` |
| 767 | |
| ----- | |
| 768 | |
| 769 | The first several lines list configurables: |
| 770 | |
| 771 | * **`c`**: the name of the Docker container you’re bundling up for use |
| 772 | with `runc` |
| 773 | * **`b`**: the path of the exported container, called the “bundle” in |
| 774 | OCI jargon; we’re using the [`nspawn`](#nspawn) convention, a |
| 775 | reasonable choice under the [Linux FHS rules][LFHS] |
| 776 | * **`h`**: the remote host name |
| 777 | * **`m`**: the local directory holding the running machines, configurable |
| 778 | because: |
| 779 | * the path name is longer than we want to use inline |
| 780 | * it’s been known to change from one version of Docker to the next |
| 781 | * you might be building and testing with [Podman](#podman), so it |
| 782 | has to be “`/run/user/$UID/crun`” instead |
| 783 | * **`t`**: the temporary bundle directory we populate locally, then |
| 784 | `scp` to the remote machine, where it’s unpacked |
| 785 | |
| 786 | [LFHS]: https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard |
| 787 | |
| 788 | |
| 789 | ##### Why All That `sudo` Stuff? |
| 790 | |
| 791 | This script uses `sudo` for two different purposes: |
| 792 | |
| 793 | 1. To read the local `config.json` file out of the `containerd` managed |
| 794 | directory, which is owned by `root` on Docker systems. Additionally, |
| 795 | that input file is only available while the container is started, so |
| 796 | we must ensure that before extracting it. |
| 797 | |
| 798 | 2. To unpack the bundle onto the remote machine. If you try to get |
| 799 | clever and unpack it locally, then `rsync` it to the remote host to |
| 800 | avoid re-copying files that haven’t changed since the last update, |
| 801 | you’ll find that it fails when it tries to copy device nodes, to |
| 802 | create files owned only by the remote root user, and so forth. If the |
| 803 | container bundle is small, it’s simpler to re-copy and unpack it |
| 804 | fresh each time. |
| 805 | |
| 806 | I point all this out because it might ask for your password twice: once for |
| 807 | the local sudo command, and once for the remote. |
| 808 | |
| 809 | |
| 810 | |
| 811 | ##### Why All That `jq` Stuff? |
| 812 | |
| 813 | We’re using [jq] for two separate purposes: |
| 814 | |
| 815 | 1. To automatically transmogrify Docker’s container configuration so it |
| 816 | will work with `runc`: |
| 817 | |
| 818 | * point it where we unpacked the container’s exported rootfs |
| 819 | * accede to its wish to [manage cgroups by itself][ecg] |
| 820 | * remove the `sysctl` calls that will break after… |
| 821 | * …we remove the network namespace to allow Fossil’s TCP listening |
| 822 | port to be available on the host; `runc` doesn’t offer the |
| 823 | equivalent of `docker create --publish`, and we can’t be |
| 824 | bothered to set up a manual mapping from the host port into the |
| 825 | container |
| 826 | * remove file bindings that point into the local runtime managed |
| 827 | directories; one of the things we give up by using a bare |
| 828 | container runner is automatic management of these files |
| 829 | * remove the hooks for essentially the same reason |
| 830 | |
| 831 | 2. To make the Docker-managed machine-readable `config.json` more |
| 832 | human-readable, in case there are other things you want changed in |
| 833 | this version of the container. Exposing the `config.json` file like |
| 834 | this means you don’t have to rebuild the container merely to change |
| 835 | a value like a mount point, the kernel capability set, and so forth. |
| 836 | |
| 837 | |
| 838 | ##### Running the Bundle |
| 839 | |
| 840 | With the container exported to a bundle like this, you can start it as: |
| 841 | |
| 842 | ``` |
| 843 | $ cd /path/to/bundle |
| 844 | $ c=fossil-runc ← …or anything else you prefer |
| 845 | $ sudo runc create $c |
| 846 | $ sudo runc start $c |
| 847 | $ sudo runc exec $c -t sh -l |
| 848 | ~ $ ls museum |
| 849 | repo.fossil |
| 850 | ~ $ ps -eaf |
| 851 | PID USER TIME COMMAND |
| 852 | 1 fossil 0:00 bin/fossil server --create … |
| 853 | ~ $ exit |
| 854 | $ sudo runc kill $c |
| 855 | $ sudo runc delete $c |
| 856 | ``` |
| 857 | |
| 858 | If you’re doing this on the export host, the first command is “`cd $b`” |
| 859 | if we’re using the variables from the shell script above. Alternately, |
| 860 | the `runc` subcommands that need to read the bundle files take a |
| 861 | `--bundle/-b` flag to let you avoid switching directories. |
| 862 | |
| 863 | The rest should be straightforward: create and start the container as |
| 864 | root so the `chroot(2)` call inside the container will succeed, then get |
| 865 | into it with a login shell and poke around to prove to ourselves that |
| 866 | everything is working properly. It is. Yay! |
| 867 | |
| 868 | The remaining commands show shutting the container down and destroying |
| 869 | it, simply to show how these commands change relative to using the |
| 870 | Docker Engine commands. It’s “kill,” not “stop,” and it’s “delete,” not |
| 871 | “rm.” |
| 872 | |
| 873 | [ecg]: https://github.com/opencontainers/runc/pull/3131 |
| 874 | [jq]: https://stedolan.github.io/jq/ |
| 875 | |
| 876 | |
| 877 | ##### Lack of Layer Sharing |
| 878 | |
| 879 | The bundle export process collapses Docker’s union filesystem down to a |
| 880 | single layer. Atop that, it makes all files mutable. |
| 881 | |
| 882 | All of this is fine for tiny remote hosts with a single container, or at |
| 883 | least one where none of the containers share base layers. Where it |
| 884 | becomes a problem is when you have multiple Fossil containers on a |
| 885 | single host, since they all derive from the same base image. |
| 886 | |
| 887 | The full-featured container runtimes above will intelligently share |
| 888 | these immutable base layers among the containers, storing only the |
| 889 | differences in each individual container. More, when pulling images from |
| 890 | a registry host, they’ll transfer only the layers you don’t have copies |
| 891 | of locally, so you don’t have to burn bandwidth sending copies of Alpine |
| 892 | and BusyBox each time, even though they’re unlikely to change from one |
| 893 | build to the next. |
| 894 | |
| 895 | |
| 896 | #### 6.3.2 <a id="crun"></a>`crun` |
| 897 | |
| 898 | In the same way that [Docker Engine is based on `runc`](#runc), Podman’s |
| 899 | engine is based on [`crun`][crun], a lighter-weight alternative to |
| 900 | `runc`. It’s only 1.4 MiB on the system I tested it on, yet it will run |
| 901 | the same container bundles as in my `runc` examples above. We saved |
| 902 | more than that by compressing the container’s Fossil executable with |
| 903 | UPX, making the runtime virtually free in this case. The only question |
| 904 | is whether you can put up with its limitations, which are the same as |
| 905 | for `runc`. |
| 906 | |
| 907 | [crun]: https://github.com/containers/crun |
| 908 | |
| 909 | |
| 910 | #### 6.3.3 <a id="nspawn"></a>`systemd-nspawn` |
| 911 | |
| 912 | As of `systemd` version 242, its optional `nspawn` piece |
| 913 | [reportedly](https://www.phoronix.com/news/Systemd-Nspawn-OCI-Runtime) |
| 914 | got the ability to run OCI bundles directly. You might |
| 915 | have it installed already, but if not, it’s only about 2 MiB. It’s |
| 916 | in the `systemd-containers` package as of Ubuntu 22.04 LTS: |
| 917 | |
| 918 | ``` |
| 919 | $ sudo apt install systemd-containers |
| 920 | ``` |
| 921 | |
| 922 | It’s also in CentOS Stream 9, under the same name. |
| 923 | |
| 924 | You create the bundles the same way as with [the `runc` method |
| 925 | above](#runc). The only thing that changes are the top-level management |
| 926 | commands: |
| 927 | |
| 928 | ``` |
| 929 | $ sudo systemd-nspawn \ |
| 930 | --oci-bundle=/var/lib/machines/fossil \ |
| 931 | --machine=fossil \ |
| 932 | --network-veth \ |
| 933 | --port=127.0.0.1:127.0.0.1:9999:8080 |
| 934 | $ sudo machinectl list |
| 935 | No machines. |
| 936 | ``` |
| 937 | |
| 938 | This is why I wrote “reportedly” above: I couldn’t get it to work on two different |
| 939 | Linux distributions, and I can’t see why. I’m leaving this here to give |
| 940 | someone else a leg up, with the hope that they will work out what’s |
| 941 | needed to get the container running and registered with `machinectl`. |
| 942 | |
| 943 | As of this writing, the tool expects an OCI container version of |
| 944 | “1.0.0”. I had to edit this at the top of my `config.json` file to get |
| 945 | the first command to read the bundle. The fact that it errored out when |
| 946 | I had “`1.0.2-dev`” in there proves it’s reading the file, but it |
| 947 | doesn’t seem able to make sense of what it finds there, and it doesn’t |
| 948 | give any diagnostics to say why. |
| 949 | |
| 950 | |
| 951 | <div style="height:50em" id="this-space-intentionally-left-blank"></div> |
| 952 |
| --- www/containers.md | |
| +++ www/containers.md | |
| @@ -484,11 +484,11 @@ | |
| 484 | that’s still a big chunk of your storage budget. It takes 100:1 overhead |
| 485 | just to run a 4 MiB Fossil server container? Once again, I wouldn’t |
| 486 | blame you if you noped right on out of here, but if you will be patient, |
| 487 | you will find that there are ways to run Fossil inside a container even |
| 488 | on entry-level cloud VPSes. These are well-suited to running Fossil; you |
| 489 | don’t have to resort to [raw Fossil service][srv] to succeed, |
| 490 | leaving the benefits of containerization to those with bigger budgets. |
| 491 | |
| 492 | For the sake of simple examples in this section, we’ll assume you’re |
| 493 | integrating Fossil into a larger web site, such as with our [Debian + |
| 494 | nginx + TLS][DNT] plan. This is why all of the examples below create |
| @@ -521,10 +521,11 @@ | |
| 521 | this idea to the rest of your site.) |
| 522 | |
| 523 | [DD]: https://www.docker.com/products/docker-desktop/ |
| 524 | [DE]: https://docs.docker.com/engine/ |
| 525 | [DNT]: ./server/debian/nginx.md |
| 526 | [srv]: ./server/ |
| 527 | |
| 528 | |
| 529 | ### 6.1 <a id="nerdctl" name="containerd"></a>Stripping Docker Engine Down |
| 530 | |
| 531 | The core of Docker Engine is its [`containerd`][ctrd] daemon and the |
| @@ -556,12 +557,12 @@ | |
| 557 | give up the image builder is [Podman]. Initially created by |
| 558 | Red Hat and thus popular on that family of OSes, it will run on |
| 559 | any flavor of Linux. It can even be made to run [on macOS via Homebrew][pmmac] |
| 560 | or [on Windows via WSL2][pmwin]. |
| 561 | |
| 562 | On Ubuntu 22.04, the installation size is about 38 MiB, roughly a |
| 563 | tenth the size of Docker Engine. |
| 564 | |
| 565 | Although Podman [bills itself][whatis] as a drop-in replacement for the |
| 566 | `docker` command and everything that sits behind it, some of the tool’s |
| 567 | design decisions affect how our Fossil containers run, as compared to |
| 568 | using Docker. The most important of these is that, by default, Podman |
| @@ -703,251 +704,322 @@ | |
| 704 | container images across the Internet, it can be a net win in terms of |
| 705 | build time. |
| 706 | |
| 707 | |
| 708 | |
| ----- | |
| ----- | |
| 709 | ### 6.3 <a id="nspawn"></a>`systemd-container` |
| 710 | |
| 711 | If even the Podman stack is too big for you, the next-best option I’m |
| 712 | aware of is the `systemd-container` infrastructure on modern Linuxes, |
| 713 | available since version 239 or so. Its runtime tooling requires only |
| 714 | about 1.4 MiB of disk space: |
| 715 | |
| 716 | ``` |
| 717 | $ sudo apt install systemd-container btrfs-tools |
| 718 | ``` |
| 719 | |
| 720 | That command assumes the primary test environment for |
| 721 | this guide, Ubuntu 22.04 LTS with `systemd` 249. For best |
| 722 | results, `/var/lib/machines` should be a btrfs volume, because |
| 723 | [`$REASONS`][mcfad]. (For CentOS Stream 9 and other Red Hattish |
| 724 | systems, you will have to make serveral adjustments, which we’ve |
| 725 | collected [below](#nspawn-centos) to keep these examples clear.) |
| 726 | |
| 727 | The first configuration step is to convert the Docker container into |
| 728 | a “machine”, as systemd calls it. The easiest method is: |
| 729 | |
| 730 | ``` |
| 731 | $ make container-run |
| 732 | $ docker container export fossil-e119d5983620 | |
| 733 | machinectl import-tar - myproject |
| 734 | ``` |
| 735 | |
| 736 | Copy the container name from the first step to the second. Yours will |
| 737 | almost certainly be named after a different Fossil commit ID. |
| 738 | |
| 739 | It’s important that the name of the machine you create — |
| 740 | “`myproject`” in this example — matches the base name |
| 741 | of the nspawn configuration file you create as the next step. |
| 742 | Therefore, to extend the example, the following file needs to be |
| 743 | called `/etc/systemd/nspawn/myproject.nspawn`, and it will contain |
| 744 | something like: |
| 745 | |
| 746 | ---- |
| 747 | |
| 748 | ``` |
| 749 | [Exec] |
| 750 | WorkingDirectory=/jail |
| 751 | Parameters=bin/fossil server \ |
| 752 | --baseurl https://example.com/myproject \ |
| 753 | --chroot /jail \ |
| 754 | --create \ |
| 755 | --jsmode bundled \ |
| 756 | --localhost \ |
| 757 | --port 9000 \ |
| 758 | --scgi \ |
| 759 | --user admin \ |
| 760 | museum/repo.fossil |
| 761 | DropCapability= \ |
| 762 | CAP_AUDIT_WRITE \ |
| 763 | CAP_CHOWN \ |
| 764 | CAP_FSETID \ |
| 765 | CAP_KILL \ |
| 766 | CAP_MKNOD \ |
| 767 | CAP_NET_BIND_SERVICE \ |
| 768 | CAP_NET_RAW \ |
| 769 | CAP_SETFCAP \ |
| 770 | CAP_SETPCAP |
| 771 | ProcessTwo=yes |
| 772 | LinkJournal=no |
| 773 | Timezone=no |
| 774 | |
| 775 | [Files] |
| 776 | Bind=/home/fossil/museum/myproject:/jail/museum |
| 777 | |
| 778 | [Network] |
| 779 | VirtualEthernet=no |
| 780 | ``` |
| 781 | |
| 782 | ---- |
| 783 | |
| 784 | If you recognize most of that from the `Dockerfile` discussion above, |
| 785 | congratulations, you’ve been paying attention. The rest should also |
| 786 | be clear from context. |
| 787 | |
| 788 | Some of this is expected to vary. For one, the command given in the |
| 789 | `Parameters` directive assumes [SCGI proxying via nginx][DNT]. For |
| 790 | other use cases, see our collection of [Fossil server configuration |
| 791 | guides][srv], then adjust the command to your local needs. |
| 792 | For another, you will likely have to adjust the `Bind` value to |
| 793 | point at the directory containing the `repo.fossil` file referenced |
| 794 | in the command. |
| 795 | |
| 796 | We also need a generic systemd unit file called |
| 797 | `/etc/systemd/system/[email protected]`, containing: |
| 798 | |
| 799 | ---- |
| 800 | |
| 801 | ``` |
| 802 | [Unit] |
| 803 | Description=Fossil %i Repo Service |
| 804 | [email protected] [email protected] |
| 805 | After=network.target systemd-resolved.service [email protected] [email protected] |
| 806 | |
| 807 | [Service] |
| 808 | ExecStart=systemd-nspawn --settings=override --read-only --machine=%i bin/fossil |
| 809 | |
| 810 | [Install] |
| 811 | WantedBy=multi-user.target |
| 812 | ``` |
| 813 | |
| 814 | ---- |
| 815 | |
| 816 | You shouldn’t have to change any of this because we’ve given the |
| 817 | `--setting=override` flag, meaning any setting in the nspawn file |
| 818 | overrides the setting passed to `systemd-nspawn`. This arrangement |
| 819 | not only keeps the unit file simple, it allows multiple services to |
| 820 | share the base configuration, varying on a per-repo level. |
| 821 | |
| 822 | Start the service in the normal way: |
| 823 | |
| 824 | ``` |
| 825 | $ sudo systemctl enable fossil@myproject |
| 826 | $ sudo systemctl start fossil@myproject |
| 827 | ``` |
| 828 | |
| 829 | You should find it running on localhost port 9000 per the nspawn |
| 830 | configuration file above, suitable for proxying Fossil out to the |
| 831 | public using nginx, via SCGI. If you aren’t using a front-end proxy |
| 832 | and want Fossil exposed to the world, you might say this instead in |
| 833 | the `nspawn` file: |
| 834 | |
| 835 | ``` |
| 836 | Parameters=bin/fossil server \ |
| 837 | --cert /path/to/my/fullchain.pem \ |
| 838 | --chroot /jail \ |
| 839 | --create \ |
| 840 | --jsmode bundled \ |
| 841 | --port 443 \ |
| 842 | --user admin \ |
| 843 | museum/repo.fossil |
| 844 | ``` |
| 845 | |
| 846 | You would also need to un-drop the `CAP_NET_BIND_SERVICE` capability |
| 847 | to allow Fossil to bind to this low-numbered port. |
| 848 | |
| 849 | We use systemd’s template file feature to allow multiple Fossil |
| 850 | servers running on a single machine, each on a different TCP port, |
| 851 | as when proxying them out as subdirectories of a larger site. |
| 852 | To add another project, you must first clone the base “machine” layer: |
| 853 | |
| 854 | ``` |
| 855 | $ sudo machinectl clone myproject otherthing |
| 856 | ``` |
| 857 | |
| 858 | That will not only create a clone of `/var/lib/machines/myproject` |
| 859 | as `../otherthing`, it will create a matching `nspawn` file for you |
| 860 | as a copy of the first one. Adjust its contents to suit, then enable |
| 861 | and start it as above. |
| 862 | |
| 863 | [mcfad]: https://www.freedesktop.org/software/systemd/man/machinectl.html#Files%20and%20Directories |
| 864 | |
| 865 | |
| 866 | ### 6.3.1 <a id="nspawn-rhel"></a>Getting It Working on a RHEL Clone |
| 867 | |
| 868 | The biggest difference between doing this on OSes like CentOS versus |
| 869 | Ubuntu is that RHEL (thus also its clones) doesn’t ship btrfs in |
| 870 | its kernel, thus has no option for installing `mkfs.btrfs`, which |
| 871 | [`machinectl`][mctl] needs for various purposes. |
| 872 | |
| 873 | Fortunately, there are workarounds. |
| 874 | |
| 875 | First, the `apt install` command above becomes: |
| 876 | |
| 877 | ``` |
| 878 | $ sudo dnf install systemd-container |
| 879 | ``` |
| 880 | |
| 881 | Second, you have to hack around the lack of `machinectl import-tar` so: |
| 882 | |
| 883 | ``` |
| 884 | $ rootfs=/var/lib/machines/fossil |
| 885 | $ sudo mkdir -p $rootfs |
| 886 | $ docker container export fossil | sudo tar -xf -C $rootfs - |
| 887 | ``` |
| 888 | |
| 889 | The parent directory path in the `rootfs` variable is important, |
| 890 | because although we aren’t using `machinectl`, the `systemd-nspawn` |
| 891 | developers assume you’re using them together. Thus, when you give |
| 892 | `--machine`, it assumes the `machinectl` directory scheme. You could |
| 893 | instead use `--directory`, allowing you to store the rootfs whereever |
| 894 | you like, but why make things difficult? It’s a perfectly sensible |
| 895 | default, consistent with the [LHS] rules. |
| 896 | |
| 897 | The final element — the machine name — can be anything |
| 898 | you like so long as it matches the nspawn file’s base name. |
| 899 | |
| 900 | Finally, since you can’t use `machinectl clone`, you have to make |
| 901 | a wasteful copy of `/var/lib/machines/myproject` when standing up |
| 902 | multiple Fossil repo services on a single machine. (This is one |
| 903 | of the reasons `machinectl` depends on `btrfs`: cheap copy-on-write |
| 904 | subvolumes.) Because we give the `--read-only` flag, you can simply |
| 905 | `cp -r` one machine to a new name rather than go through the |
| 906 | export-and-import dance you used to create the first one. |
| 907 | |
| 908 | [LHS]: https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html |
| 909 | [mctl]: https://www.freedesktop.org/software/systemd/man/machinectl.html |
| 910 | |
| 911 | |
| 912 | ### 6.3.2 <a id="nspawn-weaknesses"></a>What Am I Missing Out On? |
| 913 | |
| 914 | For all the runtime size savings in this method, you may be wondering |
| 915 | what you’re missing out on relative to Podman, which takes up |
| 916 | roughly 27× more disk space. Short answer: lots. Long answer: |
| 917 | |
| 918 | 1. **Build system.** You’ll have to build and test your containers |
| 919 | some other way. This method is only suitable for running them |
| 920 | once they’re built. |
| 921 | |
| 922 | 2. **Orchestration.** All of the higher-level things like |
| 923 | “compose” files, Docker Swarm mode, and Kubernetes are |
| 924 | unavailable to you at this level. You can run multiple |
| 925 | instances of Fossil, but on a single machine only and with a |
| 926 | static configuration. |
| 927 | |
| 928 | 3. **Image layer sharing.** When you update an image using one of the |
| 929 | above methods, Docker and Podman are smart enough to copy only |
| 930 | changed layers. Furthermore, when you base multiple containers |
| 931 | on a single image, they don’t make copies of the base layers; |
| 932 | they can share them, because base layers are immutable, thus |
| 933 | cannot cross-contaminate. |
| 934 | |
| 935 | Because we use `sysetmd-nspawn --read-only`, we get *some* |
| 936 | of this benefit, particularly when using `machinectl` with |
| 937 | `/var/lib/machines` as a btrfs volume. Even so, the disk space |
| 938 | and network I/O optimizations go deeper in the Docker and Podman |
| 939 | worlds. |
| 940 | |
| 941 | 4. **Tooling.** Hand-creating and modifying those systemd |
| 942 | files sucks compared to “`podman container create ...`” This |
| 943 | is but one of many affordances you will find in the runtimes |
| 944 | aimed at daily-use devops warriors. |
| 945 | |
| 946 | 5. **Network virtualization.** In the scheme above, we turn off the |
| 947 | `systemd` virtual netorking support because in its default mode, |
| 948 | it wants to hide the service entirely. |
| 949 | |
| 950 | Another way to put this is that `systemd-nspawn --port` does |
| 951 | approximately *nothing* of what `docker create --publish` does |
| 952 | despite their superficial similarities. |
| 953 | |
| 954 | For this container, it doesn’t much matter, since it exposes |
| 955 | only a single port, and we do want that one port exposed, one way |
| 956 | or another. Beyond that, we get all the control we need using |
| 957 | Fossil options like `--localhost`. I point this out because in |
| 958 | more complex situations, the automatic network setup features of |
| 959 | the more featureful runtimes can save a lot of time and hassle. |
| 960 | They aren’t doing anything you couldn’t do by hand, but why |
| 961 | would you want to, given the choice? |
| 962 | |
| 963 | I expect there’s a lot more I neglected to think of when creating |
| 964 | this list, but I think it suffices to make my case as it is. If you |
| 965 | can afford the space of Podman or Docker, I strongly recommend using |
| 966 | either of them over the much lower-level `systemd-container` |
| 967 | infrastructure. |
| 968 | |
| 969 | (Incidentally, these are essentially the same reasons why we no longer |
| 970 | talk about the `crun` tool underpinning Podman in this document. It’s |
| 971 | even more limited, making it even more difficult to administer while |
| 972 | providing no runtime size advantage. The `runc` tool underpinning |
| 973 | Docker is even worse on this score, being scarcely easier to use than |
| 974 | `crun` while having a much larger footprint.) |
| 975 | |
| 976 | |
| 977 | ### 6.3.3 <a id="nspawn-assumptions"></a>Violated Assumptions |
| 978 | |
| 979 | The `systemd-container` infrastructure has a bunch of hard-coded |
| 980 | assumptions baked into it. We papered over these problems above, |
| 981 | but if you’re using these tools for other purposes on the machine |
| 982 | you’re serving Fossil from, you may need to know which assumptions |
| 983 | our container violates and the resulting consequences: |
| 984 | |
| 985 | 1. `systemd-nspawn` works best with `machinectl`, but if you haven’t |
| 986 | got `btrfs` available, you run into [trouble](#nspawn-rhel). |
| 987 | |
| 988 | 2. Our stock container starts a single static executable inside |
| 989 | a stripped-to-the-bones container rather than “boot” an OS |
| 990 | image, causing a bunch of commands to fail: |
| 991 | |
| 992 | * **`machinectl poweroff`** will fail because the container |
| 993 | isn’t running dbus. |
| 994 | * **`machinectl start`** will try to find an `/sbin/init` |
| 995 | program in the rootfs, which we haven’t got. We could |
| 996 | rename `/jail/bin/fossil` to `/sbin/init` and then hack |
| 997 | the chroot scheme to match, but ick. (This, incidentally, |
| 998 | is why we set `ProcessTwo=yes` above even though Fossil is |
| 999 | perfectly capable of running as PID 1, a fact we depend on |
| 1000 | in the other methods above.) |
| 1001 | * **`machinectl shell`** will fail because there is no login |
| 1002 | daemon running, which we purposefully avoided adding by |
| 1003 | creating a “`FROM scratch`” container. (If you need a |
| 1004 | shell, say: `sudo systemd-nspawn --machine=myproject /bin/sh`) |
| 1005 | * **`machinectl status`** won’t give you the container logs |
| 1006 | because we disabled the shared journal, which was in turn |
| 1007 | necessary because we don’t run `systemd` *inside* the |
| 1008 | container, just outside. |
| 1009 | |
| 1010 | If these are problems for you, you may wish to build a |
| 1011 | fatter container using `debootstrap` or similar. ([External |
| 1012 | tutorial][medtut].) |
| 1013 | |
| 1014 | 3. We disable the “private networking” feature since the whole |
| 1015 | point of this container is to expose a network service to the |
| 1016 | public, one way or another. If you do things the way the defaults |
| 1017 | (and thus the official docs) expect, you must push through |
| 1018 | [a whole lot of complexity][ndcmp] to re-expose this single |
| 1019 | network port. That complexity is justified only if your service |
| 1020 | is itself complex, having both private and public service ports. |
| 1021 | |
| 1022 | [medtut]: https://medium.com/@huljar/setting-up-containers-with-systemd-nspawn-b719cff0fb8d |
| 1023 | [ndcmp]: https://wiki.archlinux.org/title/systemd-networkd#Usage_with_containers |
| 1024 | |
| 1025 | <div style="height:50em" id="this-space-intentionally-left-blank"></div> |
| 1026 |