Fossil SCM

fossil-scm / www / cap-theorem.md

Source Blame History 173 lines

3ddd56d…	wyoung	1	# Fossil and the CAP Theorem
3ddd56d…	wyoung	2
3ddd56d…	wyoung	3	[The CAP theorem][cap] is a fundamental mathematical proof about
3ddd56d…	wyoung	4	distributed systems. A software system can no more get around it than a
3ddd56d…	wyoung	5	physical system can get past c, the [speed of light][sol] constant.
3ddd56d…	wyoung	6
3ddd56d…	wyoung	7	Fossil is a distributed system, so it can be useful to think about it in
3ddd56d…	wyoung	8	terms of the CAP theorem. We won’t discuss the theorem itself or how you
3ddd56d…	wyoung	9	reason using its results here. For that, we recommend [this article][tut].
3ddd56d…	wyoung	10
3ddd56d…	wyoung	11	[cap]: https://en.wikipedia.org/wiki/CAP_theorem
3ddd56d…	wyoung	12	[sol]: https://en.wikipedia.org/wiki/Speed_of_light
3ddd56d…	wyoung	13	[tut]: https://www.ibm.com/cloud/learn/cap-theorem
3ddd56d…	wyoung	14
3ddd56d…	wyoung	15
991b089…	wyoung	16	<a id="ap"></a>
3ddd56d…	wyoung	17	## Fossil Is an AP-Mode System
3ddd56d…	wyoung	18
3ddd56d…	wyoung	19	As with all common [DVCSes][dvcs], Fossil is an AP-mode system, meaning
3ddd56d…	wyoung	20	that your local clone isn’t necessarily consistent with all other clones
3ddd56d…	wyoung	21	(C), but the system is always available for use (A) and
3ddd56d…	wyoung	22	partition-tolerant (P). This is what allows you to turn off Fossil’s
3ddd56d…	wyoung	23	autosync mode, go off-network, and continue working with Fossil, even
3ddd56d…	wyoung	24	though only a single node (your local repo clone) is accessible at the
3ddd56d…	wyoung	25	time.
3ddd56d…	wyoung	26
3ddd56d…	wyoung	27	You may consider that going back online restores “C”, because upon sync,
3ddd56d…	wyoung	28	you’re now consistent with the repo you cloned from. But, if another
3ddd56d…	wyoung	29	user has gone offline in the meantime, and they’ve made commits to their
67a4c1d…	wyoung	30	disconnected repo, you aren’t consistent with them. Besides which,
67a4c1d…	wyoung	31	if another user commits to the central repo, that doesn’t push the
67a4c1d…	wyoung	32	change down to you automatically: even if all users of a Fossil system
67a4c1d…	wyoung	33	are online at the same instant, and they’re all using autosync, Fossil
67a4c1d…	wyoung	34	doesn’t guarantee consistency across the network.
3ddd56d…	wyoung	35
3ddd56d…	wyoung	36	There’s no getting around the CAP theorem!
3ddd56d…	wyoung	37
3ddd56d…	wyoung	38	[dvcs]: https://en.wikipedia.org/wiki/Distributed_version_control
3ddd56d…	wyoung	39
3ddd56d…	wyoung	40
991b089…	wyoung	41	<a id="ca"></a>
3ddd56d…	wyoung	42	## CA-Mode Fossil
3ddd56d…	wyoung	43
3ddd56d…	wyoung	44	What would it mean to redesign Fossil to be CA-mode?
3ddd56d…	wyoung	45
3ddd56d…	wyoung	46	It means we get a system that is always consistent (C) and available (A)
3ddd56d…	wyoung	47	as long as there are no partitions (P).
3ddd56d…	wyoung	48
3ddd56d…	wyoung	49	That’s basically [CVS] and [Subversion][svn]: you can only continue
67a4c1d…	wyoung	50	working with the repository itself as long as your connection to the central repo server functions.
3ddd56d…	wyoung	51
67a4c1d…	wyoung	52	It’s rather trivial to talk about single-point-of-failure systems like
67a4c1d…	wyoung	53	CVS or Subversion as
3ddd56d…	wyoung	54	CA-mode. Another common example used this way is a classical RDBMS, but
3ddd56d…	wyoung	55	aren’t we here to talk about distributed systems? What’s a good example
3ddd56d…	wyoung	56	of a distributed CA-mode system?
3ddd56d…	wyoung	57
3ddd56d…	wyoung	58	A better example is [Kafka], which in its default configuration assumes
3ddd56d…	wyoung	59	it being run on a corporate LAN in a single data center, so network
67a4c1d…	wyoung	60	partitions are exceedingly rare. It therefore sacrifices partition
67a4c1d…	wyoung	61	tolerance to get the advantages of CA-mode operation. In its particular application of
67a4c1d…	wyoung	62	this mode, a
3ddd56d…	wyoung	63	message isn’t “committed” until all running brokers have a copy of it,
67a4c1d…	wyoung	64	at which point the message becomes visible to the client(s). In that
67a4c1d…	wyoung	65	way, all clients always see the same message store as long as all of the
67a4c1d…	wyoung	66	Kafka servers are up and communicating.
3ddd56d…	wyoung	67
3ddd56d…	wyoung	68	How would that work in Fossil terms?
3ddd56d…	wyoung	69
3ddd56d…	wyoung	70	If there is only one central server and I clone it on my local laptop,
3ddd56d…	wyoung	71	then CA mode means I can only commit if the remote Fossil is available,
3ddd56d…	wyoung	72	so in that sense, it devolves to the old CVS model.
3ddd56d…	wyoung	73
3ddd56d…	wyoung	74	What if there are three clones? Perhaps there is a central server A,
3ddd56d…	wyoung	75	the clone B on my laptop, and the clone C on your laptop. Doesn’t CA
3ddd56d…	wyoung	76	mode now mean that my commit on B doesn’t exist after I commit it to
3ddd56d…	wyoung	77	the central repo A until you, my coworker, also pull down the copy
3ddd56d…	wyoung	78	of that commit to your laptop C, validating the commit through the
3ddd56d…	wyoung	79	network?
3ddd56d…	wyoung	80
3ddd56d…	wyoung	81	That’s one way to design the system, but another way would be to scope
3ddd56d…	wyoung	82	the system to only talk about proper servers, not about the clients.
3ddd56d…	wyoung	83	In that model, a CA-mode Fossil alternative might require 2+ servers to
3ddd56d…	wyoung	84	be running for proper replication. When I make a commit, if all of the
3ddd56d…	wyoung	85	configured servers aren’t online, I can’t commit. This is basically CVS
3ddd56d…	wyoung	86	with replication, but without any useful amount of failover.
3ddd56d…	wyoung	87
3ddd56d…	wyoung	88	[CVS]: https://en.wikipedia.org/wiki/Concurrent_Versions_System
3ddd56d…	wyoung	89	[Kafka]: https://engineering.linkedin.com/kafka/intra-cluster-replication-apache-kafka
3ddd56d…	wyoung	90	[svn]: https://en.wikipedia.org/wiki/Apache_Subversion
3ddd56d…	wyoung	91
3ddd56d…	wyoung	92
991b089…	wyoung	93	<a id="cp"></a>
3ddd56d…	wyoung	94	## CP-Mode Fossil
3ddd56d…	wyoung	95
3ddd56d…	wyoung	96	What if we modify our CA-mode system above with “warm spares”? We can
3ddd56d…	wyoung	97	say that commits must go to all of the spares as well as the active
3ddd56d…	wyoung	98	servers, but a loss of one active server requires that one warm spare
3ddd56d…	wyoung	99	come into active state, and all of the clients learn that the spare is
3ddd56d…	wyoung	100	now considered “active.” At this point, you have a CP-mode system, not a
3ddd56d…	wyoung	101	CA-mode system, because it’s now partition-tolerant (P) but it becomes
67a4c1d…	wyoung	102	unavailable when there aren’t enough active servers or warm
3ddd56d…	wyoung	103	spares to promote to active status.
3ddd56d…	wyoung	104
3ddd56d…	wyoung	105	CP is your classical [BFT] style distributed consensus system, where the
3ddd56d…	wyoung	106	system is available only if the client can contact a majority of the
3ddd56d…	wyoung	107	servers. This is a formalization of the warm spare concept above: with
3ddd56d…	wyoung	108	N server nodes, you need at least ⌊N / 2⌋ + 1 of them to be online
3ddd56d…	wyoung	109	for a commit to succeed.
3ddd56d…	wyoung	110
8a9facb…	wyoung	111	Many distributed database systems run in CP mode because consistency (C) and
3ddd56d…	wyoung	112	partition-tolerance (P) is a useful combination. What you lose is
3ddd56d…	wyoung	113	always-available (A) operation: with a suitably bad partition, the
3ddd56d…	wyoung	114	system goes down for users on the small side of that partition.
3ddd56d…	wyoung	115
3ddd56d…	wyoung	116	An optional CP mode for Fossil would be attractive in some ways since in
3ddd56d…	wyoung	117	some sense Fossil is a distributed DBMS, but in practical terms, it
3ddd56d…	wyoung	118	means Fossil would then not be a [DVCS] in the most useful sense, being
3ddd56d…	wyoung	119	that you could work while your client is disconnected from the remote
3ddd56d…	wyoung	120	Fossil it cloned from.
3ddd56d…	wyoung	121
3ddd56d…	wyoung	122	A fraught question is whether the non-server Fossil clones count as
3ddd56d…	wyoung	123	“nodes” in this sense.
3ddd56d…	wyoung	124
3ddd56d…	wyoung	125	If they do count, then if there are only two systems, the central server
3ddd56d…	wyoung	126	and the clone on my laptop, then it stands to reason from the formula
3ddd56d…	wyoung	127	above that I can only commit if the central server is available. In that
3ddd56d…	wyoung	128	scheme, a CP-mode Fossil is basically like CVS.
3ddd56d…	wyoung	129
3ddd56d…	wyoung	130	But what happens if my company hires a coworker to help me with the
3ddd56d…	wyoung	131	project, and this person makes their own clone of the central repo? The
3ddd56d…	wyoung	132	equation says I still need 2 nodes to be available for a commit, so if
3ddd56d…	wyoung	133	my new coworker goes off-network, that doesn’t affect whether I can make
3ddd56d…	wyoung	134	commits. Likewise, if I go off-network, my coworker can make commits to
3ddd56d…	wyoung	135	the central server.
3ddd56d…	wyoung	136
3ddd56d…	wyoung	137	But what happens if the central server goes down? The equation says we
3ddd56d…	wyoung	138	still have 2 nodes, so we should be able to commit, right? Sure, but
3ddd56d…	wyoung	139	only if my laptop and communicate directly to my coworker’s laptop! If
3ddd56d…	wyoung	140	it can’t, that’s also a network partition, so N=1 on both sides in
3ddd56d…	wyoung	141	that case. The implication is that for a true CP-mode Fossil, we’d need
3ddd56d…	wyoung	142	some kind of peer-to-peer networking layer so that our laptops can
3ddd56d…	wyoung	143	accept commits from the other, so that when the central server comes
3ddd56d…	wyoung	144	online, one of us can send the results up to it to get it caught up.
3ddd56d…	wyoung	145
3ddd56d…	wyoung	146	But doesn’t that then mean there is no security? How does [Fossil’s RBAC
3ddd56d…	wyoung	147	system][caps] work if peer-to-peer commits are allowed?
3ddd56d…	wyoung	148
3ddd56d…	wyoung	149	You can instead reconceptualize the system as “node” meaning only server
3ddd56d…	wyoung	150	nodes, so that client-only systems don’t count. This allows you to have
3ddd56d…	wyoung	151	an RBAC system again.
3ddd56d…	wyoung	152
3ddd56d…	wyoung	153	With just one central server, ⌊1/2⌋+1=1, so you get CVS-like behavior:
3ddd56d…	wyoung	154	if the server’s up, you can commit.
3ddd56d…	wyoung	155
3ddd56d…	wyoung	156	If you set up 2 servers for redundancy, both must be up for commits to
3ddd56d…	wyoung	157	be allowed, since otherwise you could end up with half the commits going
3ddd56d…	wyoung	158	to the server on one side of a network partition, half going to the
3ddd56d…	wyoung	159	other, and no way to arbitrate among the two once the partition is
3ddd56d…	wyoung	160	lifted.
3ddd56d…	wyoung	161
3ddd56d…	wyoung	162	(Today’s AP-mode Fossil has this capability, but the necessary cost is
3ddd56d…	wyoung	163	“C”, consistency! Once again, you can’t get around the CAP theorem.)
3ddd56d…	wyoung	164
3ddd56d…	wyoung	165	3 servers is more sensible: any client that can see at least 2 of them
3ddd56d…	wyoung	166	can commit.
3ddd56d…	wyoung	167
3ddd56d…	wyoung	168	Will there ever be a CP-mode Fossil? This author doubts it, but as I’ve
3ddd56d…	wyoung	169	shown, it would be useful in contexts where you’d rather have a
3ddd56d…	wyoung	170	guarantee of consistency than availability.
3ddd56d…	wyoung	171
3ddd56d…	wyoung	172	[BFT]: https://en.wikipedia.org/wiki/Byzantine_fault
3ddd56d…	wyoung	173	[caps]: ./caps/

Fossil SCM

Keyboard Shortcuts