Fossil SCM

fossil-scm / www / cap-theorem.md

Source Rendered

Blame History Raw 174 lines

1	`# Fossil and the CAP Theorem`
2
3	`[The CAP theorem][cap] is a fundamental mathematical proof about`
4	`distributed systems. A software system can no more get around it than a`
5	`physical system can get past c, the [speed of light][sol] constant.`
6
7	`Fossil is a distributed system, so it can be useful to think about it in`
8	`terms of the CAP theorem. We won’t discuss the theorem itself or how you`
9	`reason using its results here. For that, we recommend [this article][tut].`
10
11	`[cap]: https://en.wikipedia.org/wiki/CAP_theorem`
12	`[sol]: https://en.wikipedia.org/wiki/Speed_of_light`
13	`[tut]: https://www.ibm.com/cloud/learn/cap-theorem`
14
15
16	`<a id="ap"></a>`
17	`## Fossil Is an AP-Mode System`
18
19	`As with all common [DVCSes][dvcs], Fossil is an AP-mode system, meaning`
20	`that your local clone isn’t necessarily consistent with all other clones`
21	`(C), but the system is always available for use (A) and`
22	`partition-tolerant (P). This is what allows you to turn off Fossil’s`
23	`autosync mode, go off-network, and continue working with Fossil, even`
24	`though only a single node (your local repo clone) is accessible at the`
25	`time.`
26
27	`You may consider that going back online restores “C”, because upon sync,`
28	`you’re now consistent with the repo you cloned from. But, if another`
29	`user has gone offline in the meantime, and they’ve made commits to their`
30	`disconnected repo, you aren’t consistent with them. Besides which,`
31	`if another user commits to the central repo, that doesn’t push the`
32	`change down to you automatically: even if all users of a Fossil system`
33	`are online at the same instant, and they’re all using autosync, Fossil`
34	`doesn’t guarantee consistency across the network.`
35
36	`There’s no getting around the CAP theorem!`
37
38	`[dvcs]: https://en.wikipedia.org/wiki/Distributed_version_control`
39
40
41	`<a id="ca"></a>`
42	`## CA-Mode Fossil`
43
44	`What would it mean to redesign Fossil to be CA-mode?`
45
46	`It means we get a system that is always consistent (C) and available (A)`
47	`as long as there are no partitions (P).`
48
49	`That’s basically [CVS] and [Subversion][svn]: you can only continue`
50	`working with the repository itself as long as your connection to the central repo server functions.`
51
52	`It’s rather trivial to talk about single-point-of-failure systems like`
53	`CVS or Subversion as`
54	`CA-mode. Another common example used this way is a classical RDBMS, but`
55	`aren’t we here to talk about distributed systems? What’s a good example`
56	`of a distributed CA-mode system?`
57
58	`A better example is [Kafka], which in its default configuration assumes`
59	`it being run on a corporate LAN in a single data center, so network`
60	`partitions are exceedingly rare. It therefore sacrifices partition`
61	`tolerance to get the advantages of CA-mode operation. In its particular application of`
62	`this mode, a`
63	`message isn’t “committed” until all running brokers have a copy of it,`
64	`at which point the message becomes visible to the client(s). In that`
65	`way, all clients always see the same message store as long as all of the`
66	`Kafka servers are up and communicating.`
67
68	`How would that work in Fossil terms?`
69
70	`If there is only one central server and I clone it on my local laptop,`
71	`then CA mode means I can only commit if the remote Fossil is available,`
72	`so in that sense, it devolves to the old CVS model.`
73
74	`What if there are three clones? Perhaps there is a central server A,`
75	`the clone B on my laptop, and the clone C on your laptop. Doesn’t CA`
76	`mode now mean that my commit on B doesn’t exist after I commit it to`
77	`the central repo A until you, my coworker, also pull down the copy`
78	`of that commit to your laptop C, validating the commit through the`
79	`network?`
80
81	`That’s one way to design the system, but another way would be to scope`
82	`the system to only talk about proper servers, not about the clients.`
83	`In that model, a CA-mode Fossil alternative might require 2+ servers to`
84	`be running for proper replication. When I make a commit, if all of the`
85	`configured servers aren’t online, I can’t commit. This is basically CVS`
86	`with replication, but without any useful amount of failover.`
87
88	`[CVS]: https://en.wikipedia.org/wiki/Concurrent_Versions_System`
89	`[Kafka]: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka`
90	`[svn]: https://en.wikipedia.org/wiki/Apache_Subversion`
91
92
93	`<a id="cp"></a>`
94	`## CP-Mode Fossil`
95
96	`What if we modify our CA-mode system above with “warm spares”? We can`
97	`say that commits must go to all of the spares as well as the active`
98	`servers, but a loss of one active server requires that one warm spare`
99	`come into active state, and all of the clients learn that the spare is`
100	`now considered “active.” At this point, you have a CP-mode system, not a`
101	`CA-mode system, because it’s now partition-tolerant (P) but it becomes`
102	`unavailable when there aren’t enough active servers or warm`
103	`spares to promote to active status.`
104
105	`CP is your classical [BFT] style distributed consensus system, where the`
106	`system is available only if the client can contact a majority of the`
107	`servers. This is a formalization of the warm spare concept above: with`
108	`N server nodes, you need at least ⌊N / 2⌋ + 1 of them to be online`
109	`for a commit to succeed.`
110
111	`Many distributed database systems run in CP mode because consistency (C) and`
112	`partition-tolerance (P) is a useful combination. What you lose is`
113	`always-available (A) operation: with a suitably bad partition, the`
114	`system goes down for users on the small side of that partition.`
115
116	`An optional CP mode for Fossil would be attractive in some ways since in`
117	`some sense Fossil is a distributed DBMS, but in practical terms, it`
118	`means Fossil would then not be a [DVCS] in the most useful sense, being`
119	`that you could work while your client is disconnected from the remote`
120	`Fossil it cloned from.`
121
122	`A fraught question is whether the non-server Fossil clones count as`
123	`“nodes” in this sense.`
124
125	`If they do count, then if there are only two systems, the central server`
126	`and the clone on my laptop, then it stands to reason from the formula`
127	`above that I can only commit if the central server is available. In that`
128	`scheme, a CP-mode Fossil is basically like CVS.`
129
130	`But what happens if my company hires a coworker to help me with the`
131	`project, and this person makes their own clone of the central repo? The`
132	`equation says I still need 2 nodes to be available for a commit, so if`
133	`my new coworker goes off-network, that doesn’t affect whether I can make`
134	`commits. Likewise, if I go off-network, my coworker can make commits to`
135	`the central server.`
136
137	`But what happens if the central server goes down? The equation says we`
138	`still have 2 nodes, so we should be able to commit, right? Sure, but`
139	`only if my laptop and communicate directly to my coworker’s laptop! If`
140	`it can’t, that’s also a network partition, so N=1 on both sides in`
141	`that case. The implication is that for a true CP-mode Fossil, we’d need`
142	`some kind of peer-to-peer networking layer so that our laptops can`
143	`accept commits from the other, so that when the central server comes`
144	`online, one of us can send the results up to it to get it caught up.`
145
146	`But doesn’t that then mean there is no security? How does [Fossil’s RBAC`
147	`system][caps] work if peer-to-peer commits are allowed?`
148
149	`You can instead reconceptualize the system as “node” meaning only server`
150	`nodes, so that client-only systems don’t count. This allows you to have`
151	`an RBAC system again.`
152
153	`With just one central server, ⌊1/2⌋+1=1, so you get CVS-like behavior:`
154	`if the server’s up, you can commit.`
155
156	`If you set up 2 servers for redundancy, both must be up for commits to`
157	`be allowed, since otherwise you could end up with half the commits going`
158	`to the server on one side of a network partition, half going to the`
159	`other, and no way to arbitrate among the two once the partition is`
160	`lifted.`
161
162	`(Today’s AP-mode Fossil has this capability, but the necessary cost is`
163	`“C”, consistency! Once again, you can’t get around the CAP theorem.)`
164
165	`3 servers is more sensible: any client that can see at least 2 of them`
166	`can commit.`
167
168	`Will there ever be a CP-mode Fossil? This author doubts it, but as I’ve`
169	`shown, it would be useful in contexts where you’d rather have a`
170	`guarantee of consistency than availability.`
171
172	`[BFT]: https://en.wikipedia.org/wiki/Byzantine_fault`
173	`[caps]: ./caps/`
174

Fossil SCM

Keyboard Shortcuts