Fossil SCM

fossil-scm / www / cap-theorem.md
1
# Fossil and the CAP Theorem
2
3
[The CAP theorem][cap] is a fundamental mathematical proof about
4
distributed systems. A software system can no more get around it than a
5
physical system can get past *c*, the [speed of light][sol] constant.
6
7
Fossil is a distributed system, so it can be useful to think about it in
8
terms of the CAP theorem. We won’t discuss the theorem itself or how you
9
reason using its results here. For that, we recommend [this article][tut].
10
11
[cap]: https://en.wikipedia.org/wiki/CAP_theorem
12
[sol]: https://en.wikipedia.org/wiki/Speed_of_light
13
[tut]: https://www.ibm.com/cloud/learn/cap-theorem
14
15
16
<a id="ap"></a>
17
## Fossil Is an AP-Mode System
18
19
As with all common [DVCSes][dvcs], Fossil is an AP-mode system, meaning
20
that your local clone isn’t necessarily consistent with all other clones
21
(C), but the system is always available for use (A) and
22
partition-tolerant (P). This is what allows you to turn off Fossil’s
23
autosync mode, go off-network, and continue working with Fossil, even
24
though only a single node (your local repo clone) is accessible at the
25
time.
26
27
You may consider that going back online restores “C”, because upon sync,
28
you’re now consistent with the repo you cloned from. But, if another
29
user has gone offline in the meantime, and they’ve made commits to their
30
disconnected repo, *you* aren’t consistent with *them.* Besides which,
31
if another user commits to the central repo, that doesn’t push the
32
change down to you automatically: even if all users of a Fossil system
33
are online at the same instant, and they’re all using autosync, Fossil
34
doesn’t guarantee consistency across the network.
35
36
There’s no getting around the CAP theorem!
37
38
[dvcs]: https://en.wikipedia.org/wiki/Distributed_version_control
39
40
41
<a id="ca"></a>
42
## CA-Mode Fossil
43
44
What would it mean to redesign Fossil to be CA-mode?
45
46
It means we get a system that is always consistent (C) and available (A)
47
as long as there are no partitions (P).
48
49
That’s basically [CVS] and [Subversion][svn]: you can only continue
50
working with the repository itself as long as your connection to the central repo server functions.
51
52
It’s rather trivial to talk about single-point-of-failure systems like
53
CVS or Subversion as
54
CA-mode. Another common example used this way is a classical RDBMS, but
55
aren’t we here to talk about distributed systems? What’s a good example
56
of a *distributed* CA-mode system?
57
58
A better example is [Kafka], which in its default configuration assumes
59
it being run on a corporate LAN in a single data center, so network
60
partitions are exceedingly rare. It therefore sacrifices partition
61
tolerance to get the advantages of CA-mode operation. In its particular application of
62
this mode, a
63
message isn’t “committed” until all running brokers have a copy of it,
64
at which point the message becomes visible to the client(s). In that
65
way, all clients always see the same message store as long as all of the
66
Kafka servers are up and communicating.
67
68
How would that work in Fossil terms?
69
70
If there is only one central server and I clone it on my local laptop,
71
then CA mode means I can only commit if the remote Fossil is available,
72
so in that sense, it devolves to the old CVS model.
73
74
What if there are three clones? Perhaps there is a central server *A*,
75
the clone *B* on my laptop, and the clone *C* on your laptop. Doesn’t CA
76
mode now mean that my commit on *B* doesn’t exist after I commit it to
77
the central repo *A* until you, my coworker, *also* pull down the copy
78
of that commit to your laptop *C*, validating the commit through the
79
network?
80
81
That’s one way to design the system, but another way would be to scope
82
the system to only talk about proper *servers*, not about the clients.
83
In that model, a CA-mode Fossil alternative might require 2+ servers to
84
be running for proper replication. When I make a commit, if all of the
85
configured servers aren’t online, I can’t commit. This is basically CVS
86
with replication, but without any useful amount of failover.
87
88
[CVS]: https://en.wikipedia.org/wiki/Concurrent_Versions_System
89
[Kafka]: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
90
[svn]: https://en.wikipedia.org/wiki/Apache_Subversion
91
92
93
<a id="cp"></a>
94
## CP-Mode Fossil
95
96
What if we modify our CA-mode system above with “warm spares”? We can
97
say that commits must go to all of the spares as well as the active
98
servers, but a loss of one active server requires that one warm spare
99
come into active state, and all of the clients learn that the spare is
100
now considered “active.” At this point, you have a CP-mode system, not a
101
CA-mode system, because it’s now partition-tolerant (P) but it becomes
102
unavailable when there aren’t enough active servers or warm
103
spares to promote to active status.
104
105
CP is your classical [BFT] style distributed consensus system, where the
106
system is available only if the client can contact a *majority* of the
107
servers. This is a formalization of the warm spare concept above: with
108
*N* server nodes, you need at least ⌊*N* / 2⌋ + 1 of them to be online
109
for a commit to succeed.
110
111
Many distributed database systems run in CP mode because consistency (C) and
112
partition-tolerance (P) is a useful combination. What you lose is
113
always-available (A) operation: with a suitably bad partition, the
114
system goes down for users on the small side of that partition.
115
116
An optional CP mode for Fossil would be attractive in some ways since in
117
some sense Fossil is a distributed DBMS, but in practical terms, it
118
means Fossil would then not be a [DVCS] in the most useful sense, being
119
that you could work while your client is disconnected from the remote
120
Fossil it cloned from.
121
122
A fraught question is whether the non-server Fossil clones count as
123
“nodes” in this sense.
124
125
If they do count, then if there are only two systems, the central server
126
and the clone on my laptop, then it stands to reason from the formula
127
above that I can only commit if the central server is available. In that
128
scheme, a CP-mode Fossil is basically like CVS.
129
130
But what happens if my company hires a coworker to help me with the
131
project, and this person makes their own clone of the central repo? The
132
equation says I still need 2 nodes to be available for a commit, so if
133
my new coworker goes off-network, that doesn’t affect whether I can make
134
commits. Likewise, if I go off-network, my coworker can make commits to
135
the central server.
136
137
But what happens if the central server goes down? The equation says we
138
still have 2 nodes, so we should be able to commit, right? Sure, but
139
only if my laptop and communicate directly to my coworker’s laptop! If
140
it can’t, that’s also a network partition, so *N=1* on both sides in
141
that case. The implication is that for a true CP-mode Fossil, we’d need
142
some kind of peer-to-peer networking layer so that our laptops can
143
accept commits from the other, so that when the central server comes
144
online, one of us can send the results up to it to get it caught up.
145
146
But doesn’t that then mean there is no security? How does [Fossil’s RBAC
147
system][caps] work if peer-to-peer commits are allowed?
148
149
You can instead reconceptualize the system as “node” meaning only server
150
nodes, so that client-only systems don’t count. This allows you to have
151
an RBAC system again.
152
153
With just one central server, ⌊1/2⌋+1=1, so you get CVS-like behavior:
154
if the server’s up, you can commit.
155
156
If you set up 2 servers for redundancy, both must be up for commits to
157
be allowed, since otherwise you could end up with half the commits going
158
to the server on one side of a network partition, half going to the
159
other, and no way to arbitrate among the two once the partition is
160
lifted.
161
162
(Today’s AP-mode Fossil has this capability, but the necessary cost is
163
“C”, consistency! Once again, you can’t get around the CAP theorem.)
164
165
3 servers is more sensible: any client that can see at least 2 of them
166
can commit.
167
168
Will there ever be a CP-mode Fossil? This author doubts it, but as I’ve
169
shown, it would be useful in contexts where you’d rather have a
170
guarantee of consistency than availability.
171
172
[BFT]: https://en.wikipedia.org/wiki/Byzantine_fault
173
[caps]: ./caps/
174

Keyboard Shortcuts

Open search /
Next entry (timeline) j
Previous entry (timeline) k
Open focused entry Enter
Show this help ?
Toggle theme Top nav button