Fossil SCM

fossil-scm / www / theory1.wiki

Source Blame History 142 lines

a630398…	drh	1	<title>Thoughts On The Design Of The Fossil DVCS</title>
a630398…	drh	2
a630398…	drh	3	Two questions (or criticisms) that arise frequently regarding Fossil
a630398…	drh	4	can be summarized as follows:
a630398…	drh	5
a630398…	drh	6	1. Why is Fossil based on SQLite instead of a distributed NoSQL database?
a630398…	drh	7
a630398…	drh	8	2. Why is Fossil written in C instead of a modern high-level language?
a630398…	drh	9
a630398…	drh	10	Neither question can be answered directly because they are both
a630398…	drh	11	based on false assumptions. We claim that Fossil is not based on SQLite
a630398…	drh	12	at all and that Fossil is not based on a distributed NoSQL database
a630398…	drh	13	because Fossil is a distributed NoSQL database. And, Fossil does use
a630398…	drh	14	a modern high-level language for its implementation, namely SQL.
a630398…	drh	15
a630398…	drh	16	<h2>Fossil Is A NoSQL Database</h2>
a630398…	drh	17
a630398…	drh	18	We begin with the first question: Fossil is not based on a distributed
a630398…	drh	19	NoSQL database because Fossil <u><i>is</i></u> a distributed NoSQL database.
f47b705…	jan.nijtmans	20	Fossil is <u>not</u> based on SQLite.
a630398…	drh	21	The current implementation of Fossil uses
a630398…	drh	22	SQLite as a local store for the content of the distributed database and as
a630398…	drh	23	a cache for meta-information about the distributed database that is precomputed
a630398…	drh	24	for quick and easy presentation. But the use of SQLite in this role is an
a630398…	drh	25	implementation detail and is not fundamental to the design. Some future
a630398…	drh	26	version of Fossil might do away with SQLite and substitute a pile-of-files or
f47b705…	jan.nijtmans	27	a key/value database in place of SQLite.
a630398…	drh	28	(Actually, that is very unlikely
a630398…	drh	29	to happen since SQLite works amazingly well in its current role, but the point
a630398…	drh	30	is that omitting SQLite from Fossil is a theoretical possibility.)
a630398…	drh	31
a630398…	drh	32	The underlying database that Fossil implements has nothing to do with
a630398…	drh	33	SQLite, or SQL, or even relational database theory. The underlying
a630398…	drh	34	database is very simple: it is an unordered collection of "artifacts".
a630398…	drh	35	An artifact is a list of bytes - a "file" in the usual manner of thinking.
a630398…	drh	36	Many artifacts are simply the content of source files that have
a630398…	drh	37	been checked into the Fossil repository. Call these "content artifacts".
a630398…	drh	38	Other artifacts, known as
a630398…	drh	39	"control artifacts", contain ASCII text in a particular format that
a630398…	drh	40	defines relationships between other artifacts, such as which
a630398…	drh	41	content artifacts that go together to form a particular version of the
796db89…	drh	42	project. Each artifact is named by its SHA1 or SHA3-256 hash and is
796db89…	drh	43	thus immutable.
a630398…	drh	44	Artifacts can be added to the database but not removed (if we ignore
a630398…	drh	45	the exceptional case of [./shunning.wiki \| shunning].) Repositories
a630398…	drh	46	synchronize by computing the union of their artifact sets. SQL and
a630398…	drh	47	relation theory play no role in any of this.
a630398…	drh	48
a630398…	drh	49	SQL enters the picture only in the implementation details. The current
a630398…	drh	50	implementation of Fossil stores each artifact as a BLOB in an SQLite
a630398…	drh	51	database.
a630398…	drh	52	The current implementation also parses up each control artifact as it
a630398…	drh	53	arrives and stores the information discovered from that parse in various
a630398…	drh	54	other SQLite tables to facilitate rapid generation of reports such as
a630398…	drh	55	timelines, file histories, file lists, branch lists, and so forth. Note
a630398…	drh	56	that all of this additional information is derived from the artifacts.
a630398…	drh	57	The artifacts are canonical. The relational tables serve only as a cache.
a630398…	drh	58	Everything in the relational tables can be recomputed
a630398…	drh	59	from the artifacts, and in fact that is exactly what happens when one runs
a630398…	drh	60	the "fossil rebuild" command on a repository.
a630398…	drh	61
a630398…	drh	62	So really, Fossil works with two separate databases. There is the
a630398…	drh	63	bag-of-artifacts database which is non-relational and distributed (like
a630398…	drh	64	a NoSQL database) and there is the local relational database. The
a630398…	drh	65	bag-of-artifacts database has a fixed format and is what defines a Fossil
a630398…	drh	66	repository. Fossil will never modify the file format of the bag-of-artifacts
a630398…	drh	67	database in an incompatible way because to do so would be to make something
a630398…	drh	68	that is no longer "Fossil". The local relational database, on the other hand,
f47b705…	jan.nijtmans	69	is a cache that contains information derived from the bag-of-artifacts.
a630398…	drh	70	The schema of the local relational database changes from time to time as
a630398…	drh	71	the Fossil implementation is enhanced, and the content is recomputed from
a630398…	drh	72	the unchanging bag of artifacts. The local relational database is an
a630398…	drh	73	implementation detail which currently happens to use SQLite.
a630398…	drh	74
a630398…	drh	75	Another way to think of the relational tables in a Fossil repository is
a630398…	drh	76	as an index for the artifacts. Without the relational tables,
a630398…	drh	77	to generate a report like a timeline would require scanning every artifact -
a3be0b8…	drh	78	the equivalent of a full table scan. The relational tables hold pointers to
44b02c3…	BMorgat	79	the relevant artifacts in presorted order so that generating a timeline
a630398…	drh	80	is much more efficient. So like an index in a relational database, the
a3be0b8…	drh	81	relational tables in a Fossil repository do not add any new information,
a630398…	drh	82	they merely make the information in the artifacts faster and easier to
a630398…	drh	83	look up.
a630398…	drh	84
a630398…	drh	85	Fossil is not "based" on SQLite. Fossil simply exploits SQLite as
a630398…	drh	86	a powerful tool to make the implementation easier.
a630398…	drh	87	And Fossil doesn't use a distributed
a630398…	drh	88	NoSQL database because Fossil is a distributed NoSQL database. That answers
a630398…	drh	89	the first question.
a630398…	drh	90
a630398…	drh	91	<h2>SQL Is A High-Level Scripting Language</h2>
a630398…	drh	92
a630398…	drh	93	The second concern states that Fossil does not use a high-level scripting
f47b705…	jan.nijtmans	94	language. But that is not true. Fossil uses SQL (as implemented by SQLite)
ce6b68c…	drh	95	as its scripting language.
a630398…	drh	96
a630398…	drh	97	This misunderstanding likely arises because people fail
a630398…	drh	98	to appreciate that SQL is a programming language. People are taught that SQL
ce6b68c…	drh	99	is a "query language" as if that were somehow different from a
7b011ab…	stephan	100	"programming language". But they really are two different flavors of the
a630398…	drh	101	same thing. I find that people do better with SQL if they think of
a630398…	drh	102	SQL as a programming language and each statement
81b0597…	drh	103	of SQL is a separate program. SQL is a peculiar programming language
8748d75…	eric	104	in that one uses SQL to specify <i>what</i> to compute whereas in
a630398…	drh	105	most other programming languages one specifies <i>how</i>
a630398…	drh	106	to carry out the computation.
a630398…	drh	107	This difference means that SQL
a630398…	drh	108	is an extraordinary high-level programming language, but it is still
a630398…	drh	109	just a programming language.
a630398…	drh	110
a630398…	drh	111	For certain types of problems, SQL has a huge advantage over other
a630398…	drh	112	programming languages because it is so high level and because it allows
4d1ac68…	drh	113	programmers to focus more on the <i>what</i> and less on the <i>how</i>
a630398…	drh	114	of a computation. In other words,
a630398…	drh	115	programmers tend to think about problems at a much higher level when
16e703b…	drh	116	using SQL; this can result in better applications.
a630398…	drh	117	SQL is also very dense.
a630398…	drh	118	In practice, this often means that a few
a630398…	drh	119	lines of SQL can often replace hundreds or thousands of lines of
a630398…	drh	120	procedural code, with a corresponding decrease in programming effort
a630398…	drh	121	and opportunities to introduce bugs.
a630398…	drh	122	Fossil happens to be one of those problems for which SQL is well suited.
a630398…	drh	123
a630398…	drh	124	Much of the "heavy lifting" within the Fossil implementation is carried
a630398…	drh	125	out using SQL statements. It is true that these SQL statements are glued
a630398…	drh	126	together with C code, but it turns out that C works surprisingly well in
a630398…	drh	127	that role. Several early prototypes of Fossil were written in a scripting
a630398…	drh	128	language (TCL). We normally find that TCL programs are shorter than the
a630398…	drh	129	equivalent C code by a factor of 10 or more. But in the case of Fossil,
f47b705…	jan.nijtmans	130	the use of TCL was actually making the code longer and more difficult to
a630398…	drh	131	understand.
a630398…	drh	132	And so in the final design, we switched from TCL to C in order to make
a630398…	drh	133	the code easier to implement and debug.
a630398…	drh	134
a630398…	drh	135	Without the advantages of having SQLite built in, the design might well
a630398…	drh	136	have followed a different path. Most reports generated by Fossil involve
a630398…	drh	137	a complex set of queries against the relational tables of the repository
a630398…	drh	138	database. These queries are normally implemented in only a few dozen
a630398…	drh	139	lines of SQL code. But if those queries had been implemented procedurally
a630398…	drh	140	using a key/value or pile-of-files database, it
a630398…	drh	141	may have well been the case that a high-level scripting language such as
a630398…	drh	142	Tcl, Python, or Ruby may have worked out better than C.

Fossil SCM

Keyboard Shortcuts